Regurgitative Training: The Value of Real Data in Training Large Language Models

Jinghui Zhang School of Economics and Management, Tsinghua University Dandan Qiao School of Computing, National University of Singapore Mochen Yang Carlson School of Management, University of Minnesota Qiang Wei School of Economics and Management, Tsinghua University
(Draft Date: 7/3/2024)
Abstract

What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs, such as ChatGPT and LLAMA, means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. In this paper, we evaluate the implications of such “regurgitative training” on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The ease of getting large quantities of LLM-generated data cannot compensate for performance loss –- even training with a fraction of real data is enough to outperform regurgitative training. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We carry out textual analyses to compare LLM-generated data with real human-generated data, and find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. In the first strategy, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered regurgitative training process where high-quality data are added before low-quality ones. In the second strategy, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). In the third strategy, we train an AI detection classifier to differentiate between LLM- and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data. Given the inevitability of having some LLM-generated data in the training sets of future LLMs, our work serves as both a cautionary tale of its performance implication as well as a call-to-action for developing effective mitigation strategies.

Keywords: Generative AI, Large Language Model, AI-Generated Data, Synthetic Data, Machine Learning

1 Introduction

Large language models (LLMs) are trained on inexplicably large amounts of data. Although the exact training datasets are undisclosed, popular LLMs such as ChatGPT, Llama, Claude, and Mistral are believed to have been trained on a combination of content on public Internet (e.g., Common Crawl), proprietary datasets licensed from third-parties, as well as crowd-generated data (Brown et al.,, 2020; Ouyang et al.,, 2022; Achiam et al.,, 2023; Touvron et al.,, 2023; Anthropic,, 2024). With their explosive successes come explosive adoption – people use LLMs in an ever increasing set of tasks, including writing (Noy and Zhang,, 2023; Chen and Chan,, 2023), coding (Chen et al.,, 2021), knowledge management (Lewis et al.,, 2020), scientific discovery (Bran et al.,, 2023; Vert,, 2023), and many more.

A natural consequence of such pervasive use is that, going forward, a substantial amount of content online will be created (at least partially) by LLMs. When building the next-generation LLMs, data generated by existing LLMs is likely to enter the training datasets. This produces a scenario which we refer to as Regurgitative Training, where a new LLM is trained using data that are at least partially generated by itself or other LLMs. The overarching question we seek to answer in this paper is: how does regurgitative training affect the performance of LLMs?

Regurgitative training may be inevitable. Indeed, there is evidence suggesting that a large part of the open web is already generated by machine translation models (LLMs included, Thompson et al.,, 2024). Even data that are supposed to be human-generated (e.g., manual labels on crowdsourcing platforms) are often generated by LLMs (Veselovsky et al.,, 2023). As LLMs get better, it will be increasingly hard to distinguish between LLM-generated data from human-generated data post-hoc (Yang et al.,, 2023). Addition, some LLM developers have explicitly chosen to inject LLM-generated data into their training datasets. For example, Apple acknowledges that its multi-modal LLM named MM1 has been trained on instruction-response pairs generated from GPT-4 and Llama (McKinzie et al.,, 2024).

A priori, the impact of regurgitative training on LLM performance is unclear. On one hand, it represents an appealing opportunity to obtain large quantities of synthetic training data at relatively low costs, thereby offering a data quantity advantage. On the other hand, however, LLM-generated data may have lower quality than real, human-generated data – they may contain more errors (Shumailov et al.,, 2023) or suffer more from the “hallucination problem” (Rawte et al.,, 2023).111Throughout the paper, we use “real data” or ”real human-generated data”, in contrast with synthetic LLM-generated data, to refer to data that are generated by an organic process (typically by humans). Importantly, we do not assume real, human-generated data to be completely error-free; instead, we are interested in the comparison of real vs. synthetic data in model training. In fact, although major players in the LLM arena, including Microsoft, Google and Meta, are all reported to use synthetic data for LLM training, the practice has caused doubts from mainstream media.222Sources: Microsoft, Google and Meta Bet on Fake Data to Build AI Models, The AI Revolution Is Already Losing Steam, For Data-Guzzling AI Companies, the Internet Is Too Small, and AI-Generated Data Can Poison Future AI Models.

To understand the performance implications of regurgitative training, we carry out analyses under two different settings: fine-tuning and training from scratch, both of which represent realistic practice in building LLMs. First, using machine translation as an example generative task, we fine-tune the GPT-3.5 model with data generated by (i) GPT-3.5 itself, (ii) another LLM, namely GPT-4 or LLAMA2, and (iii) ground-truth real data. We then compare the out-of-sample translation performance between models fine-tuned with LLM-generated data versus those fine-tuned with real data. Second, we also build smaller-scale transformer-based models from scratch and repeat the above regurgitative training experiments. This is done for both machine translation and another generative task – Q&A – to enhance the generalizability of our findings.

Across different generative tasks and model settings, we consistently observe that LLMs with regurgitative training underperforms those trained with real data. Given the same base model, training with more real data typically improves performance, whereas training with more LLM-generated data leads to quickly plateaued performance or even performance drops. Moreover, training with even a small proportion of real data is enough to outperform training only with LLM-generated data. The performance disadvantages of regurgitative training are especially pronounced when the LLM responsible for generating training data is not good at the task.

We also perform several textual analyses to make sense of regurgitative training’s performance disadvantages. As can be expected, errors in LLM-generated data is one of the culprits. Interestingly, we find evidence that it may not be the only contributing mechanism. Specifically, LLM-generated data exhibit lower degrees of lexical diversity than real data, echoing several recent research (Padmakumar and He,, 2023; Doshi and Hauser,, 2023; Anderson et al.,, 2024) in the contexts of academic writing or creative ideation. Lower lexical diversity in LLM-generated data may also partially explain the performance disadvantages of regurgitative training.

In light of these findings, we propose and evaluate three different strategies to more carefully leverage LLM-generated data in regurgitative training. The first strategy borrows from the semi-supervised learning literature and prioritizes using high-quality LLM-generated data over low-quality ones, where “quality” is gauged either by prediction confidence or by an external supervised learning model. Second, as an attempt to address the diversity deficit of LLM-generate data, we combine data generated by a mixture of different LLMs in training. Finally, the third strategy makes use of “AI detectors”, i.e., classification models that try to distinguish between LLM- vs. human-generated content. We train and deploy a capable AI detector model on LLM-generated data, and use LLM-generated data in regurgitative training in the order of predicted probability of being generated by humans (i.e., prioritizing LLM-generated data that resemble human-generated data). Our results demonstrate that all three strategies have some power to improve the performance of regurgitative training. In a few cases with transformer models trained from scratch, the improved performance even surpasses training with real data. Meanwhile, effectiveness of these strategies tends to be small in the fine-tuning setting, and none of them can fully close the gap from training with real data. This further highlights the unique value of real, human-generated data in LLM training.

Our work makes several contributions to the fast growing literature on generative AI and LLMs. Aside from all the amazing capabilities of modern LLMs, we offer a sobering analysis of regurgitative training, which may become inevitable as LLMs get more deeply integrated into various content generation tools and channels. Our empirical evidence demonstrates that regurgitative training stalls or hurts LLM performance, because LLM-generated data, as coherent or convincing as they may seem, still fall short of real data. Therefore, more productive regurgitative training requires a more careful use of LLM-generated data. The three mitigation strategies we propose and test represent practical design artifacts that can mitigate the performance loss associated with regurgitative training. In the meantime, the fact that no mitigation strategy we have explored can catch up with the performance of training with real data is an indication that real data remain one of the most valuable assets of LLM training, and cannot be easily substituted by synthetic data produced by existing LLMs.

2 Relevant Literature

Our work is closely related to self training in the semi-supervised learning literature and data augmentation in the deep learning literature, both of which are briefly reviewed in this section. As will be discussed later, although regurgitative training in LLMs is fundamentally different from the conventional schemes of self-training or data augmentation, both offer some valuable ideas that can inform our understanding of regurgitative training as well as potential approaches to manage its performance downsides.

2.1 Self Training

Self training is one of the classic approaches in semi-supervised learning to train a machine learning model using both labeled and unlabeled data (Scudder,, 1965; Nigam and Ghani,, 2000). Take classification tasks as an example, the idea is to first build a model on the labeled data via standard supervised learning procedures, obtain the model’s predictions on the unlabeled data, then take the most confident predictions (e.g., data instances with most extreme predicted probabilities) and treat them as additional labeled data to re-train the model. As a way to convert some unlabeled data into labeled data, self training is useful especially when the original labeled data are scarce.

There is an extensive literature on self training, both in traditional machine learning (see Pise and Kulkarni,, 2008; Triguero et al.,, 2015, for surveys of the topic) and in modern deep learning (e.g., Xie et al., 2020b, ). One key insights from this body of work is that the effectiveness of self training depends heavily on the ability to accurately estimate “prediction confidence”. Because predictions with high confidence are used as “pseudo-labels”, having accurate confidence scores imply that the “pseudo-labels” are more likely to be correct (i.e., the same as ground-truth labels).

Regurgitative training of LLMs resembles self training in that model-predicted pseudo labels are used to further train the model. However, it is unclear whether the conventional wisdom of self training would still apply for in the case of regurgitative training, because of the difficulty in assessing confidence of LLM outputs (Lin et al.,, 2023). In classification tasks, predicted class probabilities naturally serve as the measure to quantify the uncertainty in a classifier’s predictions. However, LLMs are much more complex than classifiers – they generate multi-token answers in response to prompts. Unless in highly restricted scenarios (e.g., evaluating a single-digit response to the question “what is 2+2”), it is generally not straightforward to define or measure confidence in LLM outputs. As a result, current LLMs do not automatically produce confidence scores for their responses and uncertainty quantification in LLMs remain an open question with many ongoing research, including asking LLMs themselves for confirmation (“self-reflection”, Chen and Mueller,, 2023), re-running the same prompt multiple times and measure internal consistency among responses (Kotelanski et al.,, 2023), and tapping into human expertise (Shankar et al.,, 2024).

2.2 Data Augmentation

Data augmentation represents another approach to enrich potentially limited labeled data. It works by injecting noises into existing labeled data instances to artificially create new data instances that can be assigned the same labels. In language tasks, a common data augmentation strategy is back-translation (Yu et al.,, 2018), where a sentence is first translated into a different language then back to the original language to achieve paraphrasing. In vision tasks, data augmentation may involve image transformation techniques such as rotation, color / contrast modification, etc. (Cubuk et al.,, 2020; Xie et al., 2020a, ). These augmentation strategies can benefit model performance if the injected noises do not change the labels, thereby creating more training data with valid labels.

Although data augmentation is procedurally quite different from regurgitative training, it does offer an insight that can help enhance the performance of regurgitative training. Xie et al., 2020a found that data augmentation is more effective when the augmentation strategy can generate a diverse set of instances rather than only introducing small, local perturbations. Learning from a diverse set of augmented data can enable the model to achieve competitive performance with fewer examples. Conceptually, this finding is also consistent with observations made in other machine learning research outside of data augmentation (e.g., Gong et al.,, 2019), where the diversity of training data instances is positively associated with predictive performance. Later, we leverage this insight in one of the strategies designed to mitigate performance loss of regurgitative training, by mixing data generated by different LLMs as an attempt to introduce greater diversity to the training process.

2.3 Regurgitative Training

Regurgitative training of generative AI models represents a new problem that has only begun to receive scholarly attention very recently. The earliest work we could identify is Shumailov et al., (2023), which document that using model-generated data to train next-generation models can create irreversible performance losses, a phenomenon they term “model collapse”. They demonstrate this in common generative AI architectures such as variational autoencoders, Gaussian mixture models, and small-scale LLMs. Moreover, they provide theoretical intuitions that model collapse arises due to errors in model-generated data, which accumulates over more iterations of regurgitative training. Subsequently, the phenomenon of model collapse has also been observed in generative image models (Alemohammad et al.,, 2023; Bertrand et al.,, 2023).

In the meantime, efforts to mitigate model collapse are underway. Bertrand et al., (2023) show that model collapse can be avoided if (i) the proportion of real data is sufficiently high and (ii) model-generated data approximate the distributions of real data well enough. Furthermore, Gerstgrasser et al., (2024) propose to alleviate model collapse by “accumulating data”; that is, use the totality of real and model-generated data (rather than just the model-generated data) to train new models.

We build upon this nascent stream of research and aim to make several distinct contributions. First, we consider regurgitative training of a LLM not only by data generated by itself, but also by other LLMs with varying degrees of capabilities. This is already taking place in practice (e.g., McKinzie et al.,, 2024) but has not been systematically explored in the literature. Second, prior work such as Shumailov et al., (2023) focused on early versions of generative models (e.g., non-transformer-based models or small pre-trained models). Instead, we carry out comprehensive experiments with leading LLMs at the time of research (e.g., GPT-4 and Llama2) as well as transformer models trained from scratch, thereby providing a more up-to-date understanding of regurgitative training. Finally, we propose several new mitigation strategies beyond what has been tested so far, and empirically evaluate their effectiveness.

3 Performance Impact of Regurgitative Training

In this section, we aim to understand how regurgitative training affect the performance of an LLM through two sets of experiments, respectively constructed to reflect two representative practices in LLM training: (i) fine-tuning and (ii) training from scratch. Fine-tuning allows users to adapt an existing LLM to their own use cases and, as mentioned before, is a widely adopted practice in the industry (e.g., McKinzie et al.,, 2024). We expect a lot of LLM training will take the form of fine-tuning, because training a state-of-the-art LLM from scratch is highly complex and resource-intensive. Meanwhile, we also consider the case of training a smaller-scale LLM from scratch, which may be necessary for companies that cannot leverage third-party LLMs due to data security and privacy issues.

For both fine-tuning and training from scratch, we focus on machine translation as the generative task of interest. Translation represents a common application for LLMs, and the performance of a translation model can be evaluated with well-established standards and metrics in the literature. This enables us to robustly assess the performance variations resulting from regurgitative training. In the case of training from scratch, we also replicate the main findings with a different generative task, namely Q&A.

3.1 Experiments with Fine-Tuning

To carry out LLM fine-tuning for translation, we use the Europarl parallel corpus (Koehn,, 2005). Sourced from the proceedings of the European Parliament, the corpus contains parallel sentences in multiple European languages. We specifically use pairs of German-English sentences. After basic pre-processing steps (e.g., removing special HTML tags, eliminating noisy characters, and handling null values), we end up with 1,908,849 sentence pairs for our analyses. We treat these sentence pairs as real data.

A popular and widely used metric to evaluate the performance of a translation model is the BLEU (BiLingual Evaluation Understudy) score (Papineni et al.,, 2002). It evaluates the quality of a model-generated translation (also called a “hypothesis translation”) in comparison to one or more reference translations.333The BLEU score can be analogously defined for a corpus of hypothesis translations and the corresponding corpus of reference translations. We describe the simpler case with a single hypothesis translation here for ease of understanding, and refer readers to Papineni et al., (2002) for the more general case. The BLEU score is calculated as the n-gram overlap between the hypothesis translation and reference translations. It ranges from 0 to 1 and a higher BLEU score generally indicates better quality translations. Formally, the BLEU score is defined as

BLEU=min{1,exp(1rc)}exp(n=1Nwnlogpn)BLEU11𝑟𝑐superscriptsubscript𝑛1𝑁subscript𝑤𝑛subscript𝑝𝑛\text{BLEU}=\min\left\{1,\exp\left(1-\frac{r}{c}\right)\right\}\cdot\exp\left(% \sum_{n=1}^{N}w_{n}\log{p_{n}}\right)BLEU = roman_min { 1 , roman_exp ( 1 - divide start_ARG italic_r end_ARG start_ARG italic_c end_ARG ) } ⋅ roman_exp ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (1)

In the first term, c𝑐citalic_c is the length of hypothesis translation and r𝑟ritalic_r is the “effective” length of reference translations (defined as the length of the one reference translation that best matches the hypothesis translation). This term serves as a “brevity penalty” that assigns a higher score for a better match in lengths between hypothesis and reference translations. In the second term, pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the n𝑛nitalic_n-gram precision and is defined as

pn=n-gramhypothesisCountmatched(n-gram)ngramhypothesisCount(n-gram)subscript𝑝𝑛subscript𝑛-𝑔𝑟𝑎𝑚𝑦𝑝𝑜𝑡𝑒𝑠𝑖𝑠𝐶𝑜𝑢𝑛subscript𝑡𝑚𝑎𝑡𝑐𝑒𝑑𝑛-𝑔𝑟𝑎𝑚subscript𝑛𝑔𝑟𝑎𝑚𝑦𝑝𝑜𝑡𝑒𝑠𝑖𝑠𝐶𝑜𝑢𝑛𝑡𝑛-𝑔𝑟𝑎𝑚p_{n}=\frac{\sum_{n\mbox{-}gram\in hypothesis}Count_{matched}(n\mbox{-}gram)}{% \sum_{ngram\in hypothesis}Count(n\mbox{-}gram)}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n - italic_g italic_r italic_a italic_m ∈ italic_h italic_y italic_p italic_o italic_t italic_h italic_e italic_s italic_i italic_s end_POSTSUBSCRIPT italic_C italic_o italic_u italic_n italic_t start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h italic_e italic_d end_POSTSUBSCRIPT ( italic_n - italic_g italic_r italic_a italic_m ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n italic_g italic_r italic_a italic_m ∈ italic_h italic_y italic_p italic_o italic_t italic_h italic_e italic_s italic_i italic_s end_POSTSUBSCRIPT italic_C italic_o italic_u italic_n italic_t ( italic_n - italic_g italic_r italic_a italic_m ) end_ARG (2)

where Countmatched(n-gram)𝐶𝑜𝑢𝑛subscript𝑡𝑚𝑎𝑡𝑐𝑒𝑑𝑛-𝑔𝑟𝑎𝑚Count_{matched}(n\mbox{-}gram)italic_C italic_o italic_u italic_n italic_t start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h italic_e italic_d end_POSTSUBSCRIPT ( italic_n - italic_g italic_r italic_a italic_m ) counts the number of n𝑛nitalic_n-gram matches between the hypothesis translation and reference translations. In Equation (1), the n𝑛nitalic_n-gram precision scores are then weighted by wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (e.g., uniform weighting wn=1Nsubscript𝑤𝑛1𝑁w_{n}=\frac{1}{N}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG) to compute the overall BLEU score.

To implement and evaluate regurgitative fine-tuning, several components need to be defined first, including a baseline LLM to be fine-tuned, a set of training data (either real or generated by other LLMs) used for fine-tuning, and a fine-tuned LLM for evaluation. In our context, we use the GPT-3.5 model as the baseline LLM,444At the time of our research, OpenAI’s fine-tuning service was restricted to the GPT-3.5 model. then fine-tune it with (i) real human-generated data, (ii) data generated by GPT-3.5 itself, and (iii) data generated by two other LLMs, namely GPT-4 and LLAMA2. This creates four fine-tuned LLMs, all of which are evaluated on the same testing data for performance comparison.

More specifically, we randomly select 5,000 sentence pairs from the original corpus for fine-tuning and 10,000 sentence pairs as the testing data. When fine-tuning with real data, the 5,000 German sentences are used as inputs and the corresponding 5,000 English sentences are used as target translations. When fine-tuning with LLM-generated data, the same 5,000 German sentences are used as inputs, but the target translations are generated by the corresponding LLM. For GPT-3.5, GPT-4, and LLAMA2, we obtain their translations with the same system prompt: “You are a chatbot that can translate German to English.”, and the German sentences are given to the LLMs as user inputs. Using each set of data, we carry out progressive fine-tuning over five batches, adding 1,000 data instances per batch and recording translation performance on the testing data after each batch.

We show the results in Figure 1. Each line represents the translation perform of a particular model over five fine-tuned batches. The X𝑋Xitalic_X-axis indicates batch index, marking the number of data instances utilized in the fine-tuning process. The Y𝑌Yitalic_Y-axis represents the BLEU score, where a higher value corresponds to better translation performance.

Refer to caption
Figure 1: Performance of Fine-Tuning GPT-3.5 Model

From the figure, it is evident that the performance of fine-tuning with LLM-generated data (both from the baseline LLM itself and from other LLMs) clearly lags behind the performance of fine-tuning with real human-generated data. Moreover, regurgitative training with different LLMs have differential performance impact. Fine-tuning with data generated by GPT-3.5 itself does not significantly change performance, and fine-tuning with GPT-4 generated data only results in barely noticeable performance improvement over the baseline model (i.e., at point 0 on the X𝑋Xitalic_X-axis). However, fine-tuning with LLAMA2 generated data significantly degrades performance compared to the baseline. This is likely because the three LLMs have different translation capabilities. Since we have the ground-truths translations for the 5,000 fine-tuning data, we can directly compute the BLEU scores of translations generated by the three LLMs, and indeed find GPT-4 to be the best (BLEU=0.3454𝐵𝐿𝐸𝑈0.3454BLEU=0.3454italic_B italic_L italic_E italic_U = 0.3454), followed by GPT-3.5 (BLEU=0.3428𝐵𝐿𝐸𝑈0.3428BLEU=0.3428italic_B italic_L italic_E italic_U = 0.3428) and LLAMA2 (BLEU=0.2417𝐵𝐿𝐸𝑈0.2417BLEU=0.2417italic_B italic_L italic_E italic_U = 0.2417).

These results underscore the overall negative, and potentially detrimental, effects of regurgitative training. Compared to training with real data, regurgitative training largely stalls learning. Regurgitative training with a better-performing LLM improves performance only marginally and is not sufficient to catch up with the performance on real data. Worse yet, regurgitative training with a less capable LLM can significantly hurt performance.

Note that in the above experiments, we use utilize a small set of data for fine-tuning. This decision stems from the remarkable few-shot learning capabilities of modern LLMs (Brown et al.,, 2020). In addition, we also conduct a robustness check to understand whether the performance of regurgitative training may be different if more fine-tuning data are available. Specifically, we augment the fine-tuning data size by 20 times, to a total of 100,000, and incrementally add 10,000 per batch. For efficiency and cost considerations, we only run fine-tuning with real data and data generated by GPT-3.5 itself. We then evaluate each fine-tuned models on the same testing data as before. The results are presented in Figure 2. We again observe that regurgitative training is unable to improve translation performance and substantially underperforms training with real data.

Refer to caption
Figure 2: Performance of Fine-Tuning GPT-3.5 Model (Augmented Data Size)

3.2 Experiments with Models Trained from Scratch

We now turn to training models from scratch and understanding the performance impact of regurgitative training in this case. Specifically, we build transformer models using the translation data. The transformer architecture serves as a foundational component powering the majority of modern LLMs, and has found extensive applications in machine translation and a variety of other natural language tasks (Vaswani et al.,, 2017; Wolf et al.,, 2020). We therefore choose to train small-scale transformer models from scratch, as an attempt to approximate the practice of training transformer-based models without leveraging third-party LLMs.

We follow Vaswani et al., (2017) to build the baseline translation models. Transformer has an encoder-decoder architecture, which uses stacked layers of multi-head self-attention and point-wise, fully connected feed-forward networks for both encoder and decoder. It also employs a residual connection on each sub-layer, followed by layer normalization. For implementational details of these transformer elements, we refer to Vaswani et al., (2017).

As our previous fine-tuning results have shown, the performance impact of regurgitative training can vary with the capability of the model used to generate training data. Therefore, we train both a “low-performance” and a ”high-performance” baseline models. This is done by gradually adding 50,000 sentence pairs (randomly sampled from the German-English corpus) per batch for training, and evaluate the model’s translation performance on a fixed testing dataset of 50,000 sentence pairs. As shown in Figure 3, we observe that the model’s performance improves quickly with the initial increase in training data size, and saturates after being trained with sufficient data. We choose the model trained with 50,000 data instances as our low-performance baseline and the one trained with 500,000 data instances as the high-performance baseline.

Refer to caption
Figure 3: Performance of Transformer Models with Varying Training Data Sizes

These two baseline models, corresponding to different performance levels, are then used to evaluate the effects of regurgitative training. We randomly sample a total of 300,000 data instances (outside of the training data of both the low- and high-performance baseline models) designated for regurgitative training. In batches of 10,000 data instances, we continue training both the low-performance baseline model and the high-performance baseline model with (i) real human-generated data, (ii) data generated by the low-performance model, and (iii) data generated by the high-performance model. After each batch of training, we evaluate the all models’ performance on the same testing data of 50,000 instances. The results are presented in Figure 4.

Refer to caption
Figure 4: Performance of Regurgitative Training Transformer Models (two plots have different y𝑦yitalic_y-axis scales for better readability)

For both low-performance and high-performance baseline models, regurgitative training with data generated from the low-performance model clearly underperforms training with real data. The same is true for regurgitative training of high-performance model with data generated by itself, though the performance gap is fairly small. Curiously, regurgitative training of low-performance model with data generated by high-performance model actually outperforms training with real data for the first 19 batches (i.e., top two lines in the right plot). To understand whether this is a sustainable performance advantage, we sample more data to carry out another 20 batches of regurgitative training in this case. The results, shown in Figure 5, show that regurgitative training performance starts to plateau around 30 batches, and underperforms training with real data thereafter.

Refer to caption
Figure 5: Performance of Regurgitative Training Low-Performance Model (Augmented Data Size)

The above results further demonstrate the performance cost of regurgitative training, even when businesses create and train their models from scratch. Consistent with our observations under regurgitative fine-tuning, regurgitative training with data from a more capable model is better than those from a less capable model – training with data generated from the low-performance model clearly harms performance. Regurgitative training with the more capable high-performance model can match or even surpass the performance of training with real data, but such advantages usually fade away as the size of regurgitative training data grows.

In reality, it is plausible that a mixture of both model-generated data and real data are used for training. We therefore carry out another set of experiments to check how the proportion of model-generated data in the mix affect model performance. We simulate five scenarios, where the proportion of model-generated data is 100%, 75%, 50%, 25%, and 0% respectively, and the rest of each mixture consists of real data. Naturally, the scenarios with 100% and 0% of model-generated data are the same as training purely with model-generated data or purely with real data. For clarity, we focus on training each baseline model with data generated by itself (mixed with different proportions of real data). Other experiment settings are the same as before, and the results are shown in Figure 6.

Refer to caption
Figure 6: Performance of Regurgitative Training Transformer Models with Different Proportions of Real Data (two plots have different y𝑦yitalic_y-axis scales for better readability)

In the case of high-performance baseline (i.e., left plot), because the performance gap between regurgitative training and training with real data is relatively small to begin with, the pattern is obfuscated by local performance fluctuations. Nonetheless, having a higher proportion of real data still generally leads to better performance. In contrast, the pattern becomes much clearer in the case of low-performance baseline (i.e., right plot). We can see that even a small amount of model-generated data is enough to slow down learning. As a higher proportion of model-generated data is used, the model’s performance continues to deteriorate.

3.3 Replication: Question Answering Task

In addition to machine translation, we also conduct a replication study with another common generative language task – Question Answering (Q&A). We use the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al.,, 2016), which is a widely used benchmarking dataset for developing and testing Q&A methods. SQuAD is a reading comprehension dataset composed of questions created by crowd-workers based on a collection of Wikipedia articles, with answers being segments of texts from the corresponding passages in the articles. The dataset includes 87,599 entries in the training dataset (constructed from 442 articles) and 10,570 entries in the testing dataset (constructed from 48 other articles).

Instead of end-to-end training (as was done in the previous section), here we use a pre-trained multilingual BERT model (bert-base-cased, Devlin et al.,, 2018) to extract word embeddings, which are fed into a feedforward neural network model for Q&A. Doing so allows us to evaluate regurgitative training under yet another widely adopted strategy for training generative language models (i.e., leveraging pre-trained representation models).555This strategy is often also referred to as “fine-tuning” a pre-trained model. We refrain from using this term here in order to avoid confusion with our fine-tuning experiments in Section 3.1, which are performed on top of existing LLMs (rather than a BERT-like representation model). We follow Rajpurkar et al., (2016) to evaluate Q&A performance with two metrics: Exact Match and F-1 score. Exact match measures the percentage of predicted answers that match the ground-truth answers exactly, and F-1 score measures the overlap between the predicted answers and the ground-truth answers by treating both predictions and ground-truths as bags of tokens. We calculate the average F-1 over all questions in the testing data.

Following the same procedure as in the previous section, we use increasing amounts of real data to train baseline models in order to identify a low-performance model and a high-performance model (see Appendix A for detailed results). The low-performance model is trained on all entries from 40 articles and achieves 70.77% exact match rate and 80.34% average F-1 score, whereas the high-performance model is trained on all entries from 200 articles and achieves 78.63% exact match rate and 86.59% average F-1 score.

Next we use these two baseline models for regurgitative training on entries from the remaining 242 articles (not used in training the two baseline models). In batches of 10 articles, we continue training both the low- and high-performance baselines with (i) real human-generated data, (ii) data generated by the low-performance model, and (iii) data generated by the high-performance model, for a total of 20 batches. After each batch of training, we evaluate the all models’ performance on the same testing dataset provided by SQuAD. The results are included in Figure 7, where the first row shows performance of regurgitative training the high-performance baseline model and the second row shows performance of regurgitative training the low-performance baseline model.

Refer to caption
Figure 7: Performance of Regurgitative Training Transformer Models on Q&A Tasks

We again observe that regurgitative training with model-generated data negatively affects Q&A performance, compared to training with real data. Different from the translation task, Q&A regurgitative training with data generated by the high-performance model does not improve performance (and certainly does not outperform training with real data), even though it still weakly outperforms regurgitative training with low-performance model generated data. In other words, the peril of regurgitative training is not limited to translation task and is even more severe in Q&A task.

4 Understanding Performance Loss from Regurgitative Training

Why does regurgitative training hurt performance compared to training with real data? In this section, we offer some preliminary evidence into the underlying mechanisms. Using the translation task as an example, we focus on characterizing the differences between LLM-generated training data and real data, and discuss how these differences may impact the performance.

The first mechanism is error – LLMs are not perfect and data generated by them can contain more errors than real data. New models trained on these error-prone data can therefore have inferior performance. This is also the mechanism identified and studied in prior work (Shumailov et al.,, 2023). We test this mechanism in the fine-tuning setting of Section 3.1 with the 5,000 data points used for regurgitative training. Recall that, with access to ground-truths for these data points, we have already calculated the BLEU scores of translations generated by GPT-3.5, GPT-4, and LLAMA2. We have confirmed that translations generated by GPT-4 have a slightly higher BLEU score than those generated by GPT-3.5, and both clearly have higher BLEU scores than Llama2-generated data. This aligns well with the testing performance of the corresponding fine-tuned LLMs (Figure 1).

Although BLEU is widely used to measure translation quality, it also has an important limitation that it does not explicitly account for the semantic meaning of words. A translation that uses different words than those in the ground-truth will have a low BLEU score even if it is semantically correct. In other words, having a lower BLEU score does not necessarily mean that the translation is more erroneous. In light of this, we construct two new measures, both aiming at quantifying the semantic differences of LLM-generated data vs. real data. We take each set of training data (generated by one of the LLMs or human) and perform several pre-processing steps, including (i) lower-casing, (ii) removing punctuation, (iii) removing stopwords, and (iv) lemmatization (i.e., reducing a word to its stem form). These pre-processing steps allow us to focus only on the substantive content of each translation.666For example, two different translations “Tomorrow will be raining!” and “Tomorrow will rain.” will both become “tomorrow rain” after pre-processing, as it lower-cases both sentences, removes punctuation, removes stopwords “will” and “be”, and reduces “raining” to its stem “rain”. The first metric is computed as the average cosine similarity between the embeddings of LLM and ground-truth translations, where the embeddings are obtained from the Sentence Transformer model (Reimers and Gurevych,, 2019). After pre-processing, a smaller cosine similarity implies greater semantic discrepancies of LLM translations from the ground-truths, which is indicative of translation errors. The second metric counts the number of word tokens in a ground-truth translation that satisfy two conditions: (1) they do not show up in the corresponding LLM translation and (2) even their synonyms (retrieved based on WordNet, Miller,, 1995) do not show up in the LLM translation. These non-synonymous deviations likely represent words mistranslated by LLM. Results of these two metrics are reported in the second and third rows of Table 1. We see that the two sets of GPT-generated data have higher semantic similarities with ground-truths and lower non-synonym deviations than LLAMA2, again supporting the mechanism that translation errors are partially responsible for the performance reduction of regurgitative training.

Table 1: Metrics of Translation Errors and Comparison Results
GPT-3.5 vs. Real GPT-4 vs. Real LLAMA2 vs. Real
Average Cosine Similarity 0.8047 0.8059 0.7506
Total # of Non-Synonymous Deviations 21836 21619 27962

Beyond errors, we also test a different mechanism related to lexical diversity. Several recent work suggest that LLM-generated content appears to be more homogeneous than human-generated content (Doshi and Hauser,, 2023; Anderson et al.,, 2024; Zhou and Lee,, 2024). We suspect that regurgitative training with less diverse LLM-generated data may hinder the model’s ability to generalize and result in lower testing performance. We quantify lexical diversity with two metrics. The first is a straightforward count of the total number of unique word tokens in ground-truth or LLM translations. The second adopts the self-BLEU metric proposed by Zhu et al., (2018). Self-BLEU is the BLEU score of a given text against all other texts in a corpus. Because BLEU captures lexical similarity, self-BLEU accordingly reflects how similar a text is with the rest of the corpus (higher self-BLEU implies lower diversity). For LLM-generated translations, errors may artificially decrease self-BLEU without meaningfully increase lexical diversity. We therefore remove the previously mentioned non-synonymous deviations (as approximation of errors) from LLM translations. We then average self-BLEU over the 5,000 training data points. Results of both metrics are reported in Table 2. Ground-truth translations consistently use more unique tokens and have lower average self-BLEU than LLM translations. GPT translations use more unique tokens than LLAMA2 and the three LLMs have similar average self-BLEU.

Table 2: Metrics of Lexical Diversity and Comparison Results
Real GPT-3.5 GPT-4 LLAMA2
Total # of Unique Tokens 14604 13690 13731 13081
Average Self-BLEU 0.1048 0.1154 0.1154 0.1126

Given the black-box nature of LLMs, we acknowledge that the exact process through which errors or lack of lexical diversity in training data affect model performance remains unclear. Nonetheless, these explorations provide plausible explanations for the negative performance impact of regurgitative training. More importantly, they naturally give rise to potential strategies to mitigate performance loss due to regurgitative training. We investigate a few different strategies in the next section.

5 Mitigating Performance Loss from Regurgitative Training

In this section, we propose and test a few strategies to mitigate the adverse performance impact of regurgitative training. Designing effective mitigation strategies requires first understanding the mechanisms of the adverse effects. Our explorations in the previous section provide suggestive evidence that errors and lack of lexical diversity may both be at play. Accordingly, we design three mitigation strategies to address one or both of these mechanisms:

  • Strategy 1 relies on quality quantification to gauge the likelihood of errors in synthetic data, and prioritize the use of data with high quality (i.e., low error likelihood) in regurgitative training;

  • Strategy 2 seeks to enhance lexical diversity by mixing together synthetic data generated by different LLMs in regurgitative training;

  • Strategy 3 builds an AI detection model to differentiate between synthetic vs. real data, and prioritize the use of synthetic data that most resemble real data for regurgitative training. As a competent AI detector may pick up on both errors and lexical diversity as predictive features, this strategy is designed to address both issues.

Details of each strategy and the corresponding evaluations on the translation task are discussed in the rest of this section.777As will be discussed later, the first, quality-based mitigation strategy naturally also works for the Q&A task, which we will demonstrate as part of Section 5.1.2. However, it is worth noting up front that the goal of mitigation is not to completely close the gap from the performance of training with real data – this may not be realistic in the short term. Instead, the goal is to use LLM-generated synthetic data in a more careful manner to reduce performance loss.888One might suggest the best strategy to reduce performance loss is not to use any synthetic data at all. However, as we discussed in Section 1, real human-generated data alone may not be sufficient to train next-generation LLMs.

5.1 Mitigation Strategy based on Quality Quantification

The first strategy is to identify a method to assess the quality of synthetic data, and subsequently select higher-quality data for regurgitative training. This requires defining a metric that accurately measures, or at least correlates with, data quality specific to the task at hand. One such metric, commonly used in classification contexts, is prediction confidence score. Higher prediction confidence scores usually correlate with greater probability of correct predictions, and the semi-supervised learning literature routinely uses prediction confidence as a quality metric (e.g., Scudder,, 1965). Because modern LLMs generate content by autoregressively predicting the next token, it is viable to also adopt prediction confidence, calculated based on predicted probabilities over the vocabulary, to quantify the quality of LLM-generated data. However, a practical obstacle is that when using third-party LLMs, prediction probabilities may not always be available. Therefore, we devise an alternative quality metric to guide the quality-based mitigation in the setting of LLM fine-tuning, assuming prediction probabilities are unavailable (Section 5.1.1). We also demonstrate the same mitigation strategy with transformers trained from scratch, assuming prediction probabilities are fully available (Section 5.1.2).

5.1.1 Evaluation in Fine-Tuning Setting.

In translation task, in the absence of raw prediction probabilities, the BLEU score can be used as another metric to gauge data quality. We propose to train a supervised learning model to predict the BLEU score of a LLM-generated translation. To train such a BLEU prediction model, we randomly sample 150,000 German-English sentence pairs (not previously used in Section 3.1) and obtain the translations generated by GPT-3.5, GPT-4, and LLAMA2. For each pair of German sentence and LLM translation, we compute the BLEU score using the ground-truth translation as the reference. These LLM-generated translation pairs, along with their BLEU scores, form the labeled dataset for training the BLEU prediction model.

The labeled dataset is randomly split into 80% for training and 20% for testing. Each instance of the labeled dataset is structured as input=(g1,g2,,gM,[SEP],e1,e2,,eN,[SEP]),label=BLEUformulae-sequence𝑖𝑛𝑝𝑢𝑡subscript𝑔1subscript𝑔2subscript𝑔𝑀delimited-[]𝑆𝐸𝑃subscript𝑒1subscript𝑒2subscript𝑒𝑁delimited-[]𝑆𝐸𝑃𝑙𝑎𝑏𝑒𝑙𝐵𝐿𝐸𝑈input=(g_{1},g_{2},\ldots,g_{M},[SEP],e_{1},e_{2},\ldots,e_{N},[SEP]),label=BLEUitalic_i italic_n italic_p italic_u italic_t = ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , [ italic_S italic_E italic_P ] , italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , [ italic_S italic_E italic_P ] ) , italic_l italic_a italic_b italic_e italic_l = italic_B italic_L italic_E italic_U, where gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents tokens in German sentences, ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents tokens in English translations, and [SEP]delimited-[]𝑆𝐸𝑃[SEP][ italic_S italic_E italic_P ] denotes the special separation token. Similar to an approach used in Chowdhury et al., (2021), we derive embedding of the entire input sequence from BERT (with the bert-base-multilingual-uncased pre-trained model), which are then used as input features to eight different supervised learning techniques for BLEU prediction. We train a separate BLEU prediction models for each of the three LLMs, and the testing performance of these BLEU prediction models are summarized in Appendix B. We find that the Bayesian Ridge technique exhibits relatively superior BLEU prediction performance (achieving lower MAE and MSE values).

Using the best-performing BLEU prediction model for each LLM, we predict the BLEU scores of the 5,000 LLM-generated translations previously used for regurgitative fine-tuning. Next, we rank the LLM-generated translations by their predicted BLEU scores, from high to low, then proceed to fine-tune the baseline GPT-3.5 model in batches of 1,000 data instances and evaluate the resulting performance. The batch-wise fine-tuning procedure and the testing data partition used for performance evaluation are exactly the same as in Section 3.1. We present the results in Figure 8. Please note that, because BLEU scores from regurgitative training with GPT models have much smaller variations than those from real data or LLAMA2 generated data, we also add a plot on the right side of the Figure to zoom in on GPT-related results.

Refer to caption
Figure 8: Quality-Based Mitigation Strategy: Results on LLM Fine-Tuning (plot on the right zooms in on GPT-related results for better readability)

We can see that regurgitative training using quality-ranked data shows some improvements compared to using the corresponding LLM-generated data without quality consideration, thereby supporting the utility of quality-based mitigation strategy. However, across the three LLMs we have tested, the magnitudes of performance improvement are all rather small and still far from reaching the performance level of training with real data.

5.1.2 Evaluation in Training-from-Scratch Setting.

When businesses build their own language models, as described in Section 3.2, a naturally available metric for evaluating data quality is the prediction confidence score, typically calculated based on class probability predictions. In the transformer architecture, these probabilities are the outputs of the softmax layer. Rather than using the highest (i.e., top-1) predicted probability to measure data quality, which has been shown to lead to overconfidence (Zhang et al.,, 2021; Lyu et al.,, 2020), we follow Fomicheva et al., (2020) and use the entropy of the probability distribution over the entire vocabulary. Mathematically, given a translation with T𝑇Titalic_T tokens, we calculate the entropy of probability distribution over vocabulary V𝑉Vitalic_V for each generated token t{1,,T}𝑡1𝑇t\in\{1,\ldots,T\}italic_t ∈ { 1 , … , italic_T }, then average the token-level entropy scores to form an overall translation-level entropy score:

Translation Entropy=1Tt=1Tv=1Vp(ytv)logp(ytv))\text{Translation Entropy}=-\frac{1}{T}\sum_{t=1}^{T}\sum_{v=1}^{V}p(y_{t}^{v}% )\log{p(y_{t}^{v})})Translation Entropy = - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) ) (3)

where p(ytv)𝑝superscriptsubscript𝑦𝑡𝑣p(y_{t}^{v})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) denotes the predicted probability of candidate token vV𝑣𝑉v\in Vitalic_v ∈ italic_V at position t𝑡titalic_t. A lower entropy score indicates a more confident translation.

We carry out regurgitative training by incorporating model-generated data ranked by their translation entropy scores, from low to high (equivalent to ranking data based on translation confidence, from high to low). Recall that we have trained both a low-performance baseline model and a high-performance baseline model and, accordingly, the quality-based regurgitative training is done for both models. The rest of the experiment settings, including the progressive training and evaluation procedure, are kept exactly the same as in Section 3.2. The results are displayed in Figure 9. Note that, unlike Figure 4, this figure does not contain results from adding data generated by high-performance model to train the low-performance model or vice versa. This is because it may not be reasonable to expect probability predictions to be readily available from a different model other than the baseline.

Refer to caption
Figure 9: Quality-Based Mitigation Strategy: Results on Transformer Models Trained from Scratch (two plots have different y𝑦yitalic_y-axis scales for better readability)

When the baseline model has low performance (right plot), we find that regurgitative training with quality-ranked data can mitigate performance loss to some extent, similar to what has been observed under the fine-tuning setting. The performance gain is especially evident when a relatively small amount of data (roughly one third of all model-generated data) are added. However, when the baseline model has high performance (left plot), we do not observe consistent performance benefits of the quality-based mitigation strategy. This is likely because the performance gap from training with real data is already small and performance fluctuations (e.g., due to randomness in data) may obfuscate a clear pattern of the quality-based mitigation strategy in this case.

Finally, we also apply the quality-based mitigation strategy on the Q&A task. For a given question, we analogously derive an entropy score for a model-generated answer to reflect the uncertainty of the probability distribution over all possible answers. For Q&A task, the transformer model generates a candidate answer by predicting the positions of a start token and an end token in a given passage that potentially contains the answer (Rajpurkar et al.,, 2016). The start / end tokens then jointly determine the answer text. Therefore, the probability score of a candidate answer is the product of the probabilities associated with the start and end tokens, and the entropy score can be computed as

Answer Entropy=a=1Ap(yastart)p(yaend)log{p(yastart)p(yaend)}Answer Entropysuperscriptsubscript𝑎1𝐴𝑝superscriptsubscript𝑦𝑎𝑠𝑡𝑎𝑟𝑡𝑝superscriptsubscript𝑦𝑎𝑒𝑛𝑑𝑝superscriptsubscript𝑦𝑎𝑠𝑡𝑎𝑟𝑡𝑝superscriptsubscript𝑦𝑎𝑒𝑛𝑑\text{Answer Entropy}=-\sum_{a=1}^{A}p(y_{a}^{start})p(y_{a}^{end})\log\left\{% p(y_{a}^{start})p(y_{a}^{end})\right\}Answer Entropy = - ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUPERSCRIPT ) italic_p ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT ) roman_log { italic_p ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUPERSCRIPT ) italic_p ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT ) } (4)

where aA𝑎𝐴a\in Aitalic_a ∈ italic_A is a candidate answer; p(yastart)𝑝superscriptsubscript𝑦𝑎𝑠𝑡𝑎𝑟𝑡p(y_{a}^{start})italic_p ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUPERSCRIPT ) and p(yaend)𝑝superscriptsubscript𝑦𝑎𝑒𝑛𝑑p(y_{a}^{end})italic_p ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT ) respectively denote the predicted probabilities of the start / end tokens that demarcate answer a𝑎aitalic_a. Same as before, a lower entropy corresponds to a more confident model-generated answer.

We then follow the same settings in Section 3.3 to conduct regurgitative training on both low-performance and high-performance baseline models, except that the model-generated data are added in the order of answer entropy (from low to high). At the article level, we prioritize articles based on the lowest entropy value among their constituent answers. The results are shown in Figure 10.

Refer to caption
Figure 10: Quality-Based Mitigation Strategy: Results on Q&A Task

We again observe some performance gains of regurgitative training with quality-ranked data, but the differences tend to be fairly small. To summarize, while the experiments in this section generally support the potential of the quality-based mitigation, we note that the performance gap from training with real data remains substantial. Put differently, the value of real data in training language models cannot be substituted by quality-aware regurgitative training.

5.2 Mitigation Strategy based on Data Mixture

As discussed in Section 4, a lack of lexical diversity in LLM-generated content may also contribute to the performance loss of regurgitative training. This observation prompts us to explore a mitigation strategy aimed at enhancing lexical diversity within LLM-generated data. Specifically, we propose mixing together data generated from multiple LLMs into the regurgitative training process. For example, LLMs developed by different companies, each potentially trained on somewhat distinct datasets, may consequently produce data with unique characteristics and nuances. Combining data from different “breeds” of LLMs can therefore introduce greater variability than relying on a single LLM.

We first test this data mixture strategy under LLM fine-tuning setting. With three LLMs, there are 3 possible mixture configurations: (i) mixing GPT-3.5 with GPT-4, (ii) mixing GPT-3.5 with LLAMA2, and (iii) mixing GPT-4 with LLAMA2. We consider configurations (i) and (iii) in particular. Configuration (iii) mixes two top-of-the-line LLMs developed by different companies and, based on our explorations in Section 4, they exhibit different lexical diversity. In other words, this configuration is a more direct evaluation of the proposed strategy. However, it is unfortunately confounded by the fact that LLAMA2-generated data have higher translation error rates than GPT-generated data, so the performance outcomes of this configuration cannot be solely attributed to changes in lexical diversity levels. In comparison, configuration (i) is less confounded because the two GPT models have similar translation quality, although their mixture also bring in less additional lexical variations. Taken together, we present results from both configurations to offer a more comprehensive evaluation of the data mixture strategy.

Furthermore, for each batch of 1,000 regurgitative training instances, there are two ways to add the mixture data. First, we can add (randomly selected) 500 instances from one LLM and 500 from the other LLM, thereby keeping the total training batch size unchanged. Alternatively, we can add all 1,000 instances from both models, which amounts to a training batch size of 2,000. Both are reasonable from a practical perspective, and we report both sets of results. After each batch of regurgitative training, we evaluate translation performance on the same testing data used in Section 3.1. Figure 11 shows results for GPT-4 / LLAMA2 mixture and Figure 12 shows results for the GPT-3.5 / GPT-4 mixture.

Refer to caption
Figure 11: Data Mixture-Based Mitigation Strategy: GPT-4 / LLAMA2 Mixture Results on LLM Fine-Tuning
Refer to caption
Figure 12: Data Mixture-Based Mitigation Strategy: GPT-3.5 / GPT-4 Mixture Results on LLM Fine-Tuning (plot on the right zooms in on GPT-related results for better readability)

Interestingly, mixing data generated by GPT-4 and LLAMA2 results in performance levels that fall in between using the two constituent LLMs alone, but mixing data generated by GPT-3.5 and GPT-4 can match the performance of GPT-4 (the better-performing LLM of the two) and even slightly outperform it when twice the regurgitative training data are added. These results suggest that the effectiveness of the data mixture strategy is nuanced. When constituent LLMs differ both in terms of quality and lexical diversity, mixing their data together may not lead to better regurgitative training performance, because the potential benefit of greater lexical diversity is offset by having a more error-prone LLM in the mix. In the same vein, if two LLMs with similar quality yet different lexical diversity can be identified, their mixture can indeed mitigate performance loss of regurgitative training to some extent.

We also deploy the same strategy on transformer models trained from scratch. With two baseline models, there is only one mixture configuration, namely mixing data generated by the low-performance model and high-performance model. However, these two models naturally differ on both quality and lexical diversity (just like the GPT-4 / LLAMA2 mixture). Therefore, we also consider a mixture of data generated by GPT-3.5 and high-performance model (to mimic the GPT-4 / GPT-3.5 mixture configuration). Here, GPT-3.5 serves as a high-quality translation model that is also very different from the model we trained from scratch. The results are shown in Figures 13 and 14.

Refer to caption
Figure 13: Data Mixture-Based Mitigation Strategy: High/Low Performance Model Mixture Results on Transformer Models Trained from Scratch (two plots have different y𝑦yitalic_y-axis scales for better readability)
Refer to caption
Figure 14: Data Mixture-Based Mitigation Strategy: High Performance Model / GPT-3.5 Mixture Results on Transformer Models Trained from Scratch (two plots have different y𝑦yitalic_y-axis scales for better readability)

We find highly consistent results as in the fine-tuning case. Mixing data generated by the two baseline models, which differ on quality, results in regurgitative training performance that falls between the two constituent models. Moreover, when training the high-performance baseline model with the data mixture (left plot of Figure 13), the performance follows a downward trend, likely because data generated by low-performance model substantially contaminate the quality of data mixture. Consequently, using twice the amount of mixture data in regurgitative training is worse in this case, as it further accelerates the performance decline. In contrast, mixing data generated by high-performance model and GPT-3.5, both of which have good quality, shows promising results. Regurgitative training with the data mixture matches the performance of training with real data when the same amount of data is used, and in fact outperforms it when twice the amount is used. The performance gain is especially evident when training the low-performance baseline model (right plot of Figure 14). Finally, we note that the data generated by high-performance model and GPT-3.5 have very similar quality (BLEU difference smaller than 0.3%). Therefore, the encouraging results from the high-performance model / GPT-3.5 mixture also lend additional support for our mechanism analyses in Section 4 – errors in generated data is not the only factor affecting regurgitative training performance, and other factors (such as lexical diversity) can be at play.

5.3 Mitigation Strategy based on AI Detection

The abundance of AI-generated content online has prompted academia and industry to develop various methods to distinguish between human- and AI-generated content. For instance, GPTZero is a leading AI detector used to identify whether a document was written by LLMs such as ChatGPT. This inspires us to consider using AI detection tools to mitigate the harm of regurgitative training. In particular, instead of trying to “catch” AI-generated content, we re-purpose AI detection tools to identify AI-generated content that closely resembles human-generated content. Then, we prioritize using AI-generated data that are indistinguishable from human-generated data (from the perspective of the AI detector) in regurgitative training. This mitigation strategy is essentially an “imitation” approach – regardless of why human-generated data are different from AI-generated data (error rates, lexical diversity, or other characteristics), AI-generated data that imitate human-generated data sufficiently well may be more advantageous for regurgitative training.

Starting from the fine-tuning setting, we train an AI detection classifier using randomly sampled 75,000 German-English sentence pairs (not previously used in regurgitative fine-tuning) and their corresponding translations by GPT-3.5, GPT-4, and LLAMA2. For each LLM, we construct a balanced labeled dataset with 150,000 instances, where each of the 75,000 German sentences has exactly one real human-generated translation and one LLM-generated translation. Each data instance is structured as input=(g1,g2,,gM,[SEP],e1,e2,,eN,[SEP]),label{0,1}formulae-sequence𝑖𝑛𝑝𝑢𝑡subscript𝑔1subscript𝑔2subscript𝑔𝑀delimited-[]𝑆𝐸𝑃subscript𝑒1subscript𝑒2subscript𝑒𝑁delimited-[]𝑆𝐸𝑃𝑙𝑎𝑏𝑒𝑙01input=(g_{1},g_{2},\ldots,g_{M},[SEP],e_{1},e_{2},\ldots,e_{N},[SEP]),label\in% \{0,1\}italic_i italic_n italic_p italic_u italic_t = ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , [ italic_S italic_E italic_P ] , italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , [ italic_S italic_E italic_P ] ) , italic_l italic_a italic_b italic_e italic_l ∈ { 0 , 1 } where label=0𝑙𝑎𝑏𝑒𝑙0label=0italic_l italic_a italic_b italic_e italic_l = 0 marks that the English translation is generated by an LLM and 1 otherwise. Same as how we have trained the BLEU prediction model in the previous section, we retrieve token embeddings from BERT. For each LLM, we train a separate classifier on 80% of labeled data and evaluate it on the remaining 20%. Performance scores of various supervised techniques are listed in Appendix C. The Linear Discriminant Analysis (LDA) turns out to have highest predictive performance for both GPT-3.5 and GPT-4, and Logistic Regression has the best performance for LAMMA2.

Next, we apply the best-performing AI detection classifier for each of the three LLMs on the 5,000 LLM-generated translations used for regurgitative fine-tuning. Because we know the translations are generated by LLMs, if a translation receives a higher class 1 predicted probability from the AI detection classifier, then it has a greater resemblance to real translation. Therefore, we carry out regurgitative training by adding LLM-generated data in the order of their class 1 predicted probabilities, from high to low. Other experiment settings are kept the same as in Section 3.1, and the results are shown in Figure 15.

Refer to caption
Figure 15: AI Detection-Based Mitigation Strategy: Results on LLM Fine-Tuning (plot on the right zooms in on GPT-related results for better readability)

We can see that regurgitative training in the order of resemblance with real data can indeed mitigate the performance loss for LLAMA2 and GPT-3.5, and the performance gain is larger for LLAMA2. This strategy, however, does not seem to be effective for GPT-4. We suspect such variations in mitigation effectiveness has to do with the capability of AI detection classifier – indeed, AI detection is most accurate for LLAMA2-generated data and least accurate for GPT-4-generated data (see Appendix C).

Next, we apply the same mitigation strategy on transformer models trained from scratch. Performance evaluations of various AI detection classifiers are again listed in Appendix C. We choose a Logistic Regression classifier for the high-performance baseline model and a LDA classifier for the low-performance baseline model. The regurgitative training results, with data ranked by class 1 predicted probabilities, are shown in Figure 16.

Refer to caption
Figure 16: AI Detection-Based Mitigation Strategy: Results on Training Transformer Models from Scratch (two plots have different y𝑦yitalic_y-axis scales for better readability)

The results confirm the effectiveness of AI detection-based mitigation strategy for regurgitative training with transformers trained from scratch. Notably, different from the finding in quality-based mitigation, we observe performance gain for the high-performance baseline model as well (left plot). In fact, regurgitative training with ranked data keeps up with, and even slightly outperforms, training with real data for more than 20 batches.

These encouraging results highlight the promising utility of AI detection outside of its conventional use case. Besides identifying AI-generated content, a capable AI detector can also be re-purposed to guide more meaningful regurgitative training. Meanwhile, we acknowledge that this mitigation strategy clearly does not have unlimited capacity. Under both the fine-tuning setting and with a low-performance transformer model trained from scratch, even the AI detection-enhanced regurgitative training cannot catch up with the performance of training with real data.

6 Discussions

In 1950, Alan Turing envisioned the “Imitation Game” (later termed the “Turing test”) as a test of intelligence, where a machine is treated as exhibiting intelligence if a questioner cannot reliably differentiate conversations generated by the machine or by a human being (Turing,, 1950). Now, popular LLMs on the market possess astounding capabilities to generate coherent texts and mirror human thoughts, leading some to argue that Turing test is no longer appropriate or sufficient to assess artificial intelligence (Sejnowski,, 2023; Biever,, 2023). If LLMs can already generate human-like content, a natural question to ask is whether they can effectively generate new data to keep training themselves.

Our analyses in this paper give a negative answer to this question. Training a new LLM using data generated (at least partially) by other LLMs, a process we refer to as regurgitative training, generally results in lower performance than training with real data. While performance loss of regurgitative training has been documented in Shumailov et al., (2023) with early versions of generative models (non-transformer-based models or pre-trained models before GPT-3.5), our work provides more comprehensive evidence by both fine-tuning commercial LLMs (including GPT-3.5, GPT-4, and LLAMA2) and training small-scale transformer models from scratch. Our explorations reveal more nuanced performance effects of regurgitative training. Under both fine-tuning and training-from-scratch settings, regurgitative training with data generated by a competent model may still improve performance to a small extent and at a much lower speed / magnitude than training with real data, whereas regurgitative training with data generated by a poor model hurts performance. These effects manifest even when only a small proportion of training data are synthetic. Even in the rare case where regurgitative training outperforms training with real data (i.e., training low-performance model with high-performance model generated data, Figure 4), such advantage disappears after a large amount of synthetic data is used.

To make sense of the overall negative performance impact of regurgitative training, we compare the textual data generated by LLMs vs. humans. In the context of machine translation, we find supporting evidence that LLM-generated data not only contain more translation errors but also lower lexical diversity, both of which may contribute to the performance disadvantages. These findings align with multiple recent research that documents a “diversity shortage” of LLM-generated content (e.g., Padmakumar and He,, 2023; Doshi and Hauser,, 2023; Anderson et al.,, 2024; Zhou and Lee,, 2024), and associate it with the performance loss of regurgitative training.

These explorations of underlying mechanisms also produce potential strategies to mitigate performance loss of regurgitative training. In total, we propose and test three different strategies, respectively designed to address the issues of data quality / error, lack of lexical diversity, and both. The quality-based mitigation strategy prioritizes the use of high-quality data for regurgitative training, where “quality” can be gauged either by prediction confidence or via a supervised learning approach. The data mixture strategy seeks to enhance lexical diversity by mixing together data generated from different LLMs. The AI detection-based strategy re-purposes an AI detection classifier to identify LLM-generate data that resemble real data, then prioritize their use in regurgitative training. While all three strategies can reduce performance loss to some extent, their relative effectiveness also demonstrates some interesting nuances. First, they tend to be more effective on transformer models trained from scratch than fine-tuned LLMs. Under the fine-tuning setting, none of the migration strategies can bridge the performance gap between regurgitative training and training with real data; in contrast, data mixture and AI detection based strategies can outperform training with real data on models trained from scratch. Second, the data mixture strategy does not perform well if constituent LLMs differ both in terms of quality and lexical diversity – the drop in data quality offsets the benefits of increased diversity. Instead, this strategy performs much better if constituent LLMs have comparable quality but still contribute diversity benefits (e.g., using competent models trained on different data or architectures). Finally, success of the AI detection strategy hinges on the ability to differentiate LLM- vs. human-generated data. Greater capable in AI detection generally results in better regurgitative training performance.

Several implications for both researchers and practitioners working with LLMs are worth noting. First, we urge caution when utilizing synthetic LLM-generated data when training or fine-tuning LLMs. Despite the multitude of amazing capabilities of LLMs, at their current stage, regurgitative training cannot create sustained performance improvement. Therefore, datasets that are organically generated and carefully curated (such as the Europarl corpus for machine translation and SQuAD corpus for Q&A) remain part of the core assets of LLM development. Second, the prevalent use of LLMs implies that LLM-generated data would likely take up a non-trivial proportion of online content in the near future, and some degree of regurgitative training may be unavoidable. Recognizing this trend, we advocate for more careful use of LLM-generated data. Our results suggest that “data quality” trumps “data quantity” in regurgitative training – it is generally more advantageous to use data with higher prediction confidence, greater linguistic richness, and higher resemblance to real data than merely using a larger quantity of data with questionable quality. Finally, the baseline performance of the model being regurgitatively trained also matters. All else being equal, a more capable baseline model suffers less performance loss due to regurgitative training. Therefore, it is important for researchers and businesses to first thoroughly train their baseline models before engaging in regurgitative training, which helps control the adverse impact of regurgitative training.

Our work also opens up a few interesting future research directions. Capabilities and performance of modern LLMs are constantly evolving. Is the negative performance impact of regurgitative training just a transient pattern reflecting limitations of available LLMs (which may disappear as more powerful LLMs are created in the future), or is it a fundamental issue of current paradigm of generative AI? Existing work such as Shumailov et al., (2023) and Gerstgrasser et al., (2024) attempt to answer this question by resorting to analyzing simplified models (e.g., one-dimensional Gaussian processes). Future work can employ more advanced theoretical frameworks to derive deeper understandings. Moreover, effective regurgitative training with synthetic data represents an emerging field of increasing importance. Our proposed mitigation strategies are only the first steps rather than final words, and we encourage future work to design more potent strategies that can be adopted in practice. We believe more productive use of synthetic data for LLM training requires both theoretical understanding of its performance upper bound, as well as practical algorithms and methods to achieve what is possible.

References

  • Achiam et al., (2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Alemohammad et al., (2023) Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., Siahkoohi, A., and Baraniuk, R. G. (2023). Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850.
  • Anderson et al., (2024) Anderson, B. R., Shah, J. H., and Kreminski, M. (2024). Homogenization effects of large language models on human creative ideation. arXiv preprint arXiv:2402.01536.
  • Anthropic, (2024) Anthropic, A. (2024). The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card.
  • Bertrand et al., (2023) Bertrand, Q., Bose, A. J., Duplessis, A., Jiralerspong, M., and Gidel, G. (2023). On the stability of iterative retraining of generative models on their own data. arXiv preprint arXiv:2310.00429.
  • Biever, (2023) Biever, C. (2023). Chatgpt broke the turing test-the race is on for new ways to assess ai. Nature, 619(7971):686–689.
  • Bran et al., (2023) Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., and Schwaller, P. (2023). Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376.
  • Brown et al., (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Chen and Mueller, (2023) Chen, J. and Mueller, J. (2023). Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
  • Chen et al., (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  • Chen and Chan, (2023) Chen, Z. and Chan, J. (2023). Large language model in creative work: The role of collaboration modality and user expertise. Available at SSRN 4575598.
  • Chowdhury et al., (2021) Chowdhury, S., Baili, N., and Vannah, B. (2021). Ensemble fine-tuned mbert for translation quality estimation. arXiv preprint arXiv:2109.03914.
  • Cubuk et al., (2020) Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703.
  • Devlin et al., (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Doshi and Hauser, (2023) Doshi, A. R. and Hauser, O. (2023). Generative artificial intelligence enhances creativity. Available at SSRN.
  • Fomicheva et al., (2020) Fomicheva, M., Sun, S., Yankovskaya, L., Blain, F., Guzmán, F., Fishel, M., Aletras, N., Chaudhary, V., and Specia, L. (2020). Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
  • Gerstgrasser et al., (2024) Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Sleight, H., Hughes, J., Korbak, T., Agrawal, R., Pai, D., Gromov, A., et al. (2024). Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. arXiv preprint arXiv:2404.01413.
  • Gong et al., (2019) Gong, Z., Zhong, P., and Hu, W. (2019). Diversity in machine learning. Ieee Access, 7:64323–64350.
  • Koehn, (2005) Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
  • Kotelanski et al., (2023) Kotelanski, M., Gallo, R., Nayak, A., and Savage, T. (2023). Methods to estimate large language model confidence. arXiv preprint arXiv:2312.03733.
  • Lewis et al., (2020) Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  • Lin et al., (2023) Lin, Z., Trivedi, S., and Sun, J. (2023). Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  • Lyu et al., (2020) Lyu, Z., Duolikun, D., Dai, B., Yao, Y., Minervini, P., Xiao, T. Z., and Gal, Y. (2020). You need only uncertain answers: Data efficient multilingual question answering. TWorkshop on Uncertainty and Ro-Bustness in Deep Learning.
  • McKinzie et al., (2024) McKinzie, B., Gan, Z., Fauconnier, J.-P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Weers, F., et al. (2024). Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611.
  • Miller, (1995) Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
  • Nigam and Ghani, (2000) Nigam, K. and Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. In Proceedings of the ninth international conference on Information and knowledge management, pages 86–93.
  • Noy and Zhang, (2023) Noy, S. and Zhang, W. (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654):187–192.
  • Ouyang et al., (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  • Padmakumar and He, (2023) Padmakumar, V. and He, H. (2023). Does writing with language models reduce content diversity? arXiv preprint arXiv:2309.05196.
  • Papineni et al., (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  • Pise and Kulkarni, (2008) Pise, N. N. and Kulkarni, P. (2008). A survey of semi-supervised learning methods. In 2008 International conference on computational intelligence and security, volume 2, pages 30–34. IEEE.
  • Rajpurkar et al., (2016) Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  • Rawte et al., (2023) Rawte, V., Sheth, A., and Das, A. (2023). A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
  • Reimers and Gurevych, (2019) Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  • Scudder, (1965) Scudder, H. (1965). Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363–371.
  • Sejnowski, (2023) Sejnowski, T. J. (2023). Large language models and the reverse turing test. Neural computation, 35(3):309–342.
  • Shankar et al., (2024) Shankar, S., Zamfirescu-Pereira, J., Hartmann, B., Parameswaran, A. G., and Arawjo, I. (2024). Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences. arXiv preprint arXiv:2404.12272.
  • Shumailov et al., (2023) Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., and Anderson, R. (2023). The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493.
  • Thompson et al., (2024) Thompson, B., Dhaliwal, M. P., Frisch, P., Domhan, T., and Federico, M. (2024). A shocking amount of the web is machine translated: Insights from multi-way parallelism. arXiv preprint arXiv:2401.05749.
  • Touvron et al., (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Triguero et al., (2015) Triguero, I., García, S., and Herrera, F. (2015). Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information systems, 42:245–284.
  • Turing, (1950) Turing, A. M. (1950). I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460.
  • Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  • Vert, (2023) Vert, J.-P. (2023). How will generative ai disrupt data science in drug discovery? Nature Biotechnology, 41(6):750–751.
  • Veselovsky et al., (2023) Veselovsky, V., Ribeiro, M. H., and West, R. (2023). Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. arXiv preprint arXiv:2306.07899.
  • Wolf et al., (2020) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  • (47) Xie, Q., Dai, Z., Hovy, E., Luong, T., and Le, Q. (2020a). Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:6256–6268.
  • (48) Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. (2020b). Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698.
  • Yang et al., (2023) Yang, X., Pan, L., Zhao, X., Chen, H., Petzold, L., Wang, W. Y., and Cheng, W. (2023). A survey on detection of llms-generated content. arXiv preprint arXiv:2310.15654.
  • Yu et al., (2018) Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., and Le, Q. V. (2018). Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.
  • Zhang et al., (2021) Zhang, S., Gong, C., and Choi, E. (2021). Knowing more about questions can help: Improving calibration in question answering. arXiv preprint arXiv:2106.01494.
  • Zhou and Lee, (2024) Zhou, E. and Lee, D. (2024). Generative artificial intelligence, human creativity, and art. PNAS nexus, 3(3):pgae052.
  • Zhu et al., (2018) Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. (2018). Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100.

Appendix A Identifying Low-/High-Performance Models in Q&A Task

We partition the articles in the training set of SQuAD randomly into 11 batches of data (each containing 40 articles). We incrementally add training data, one batch at a time, and evaluate the model performance on the testing data. Figure 17 shows the performance in terms of exact match and average F-1 score. Based on these results, we choose the model trained with one batch of data as the low-performance baseline, and the model trained with five batches of data as the high-performance baseline.

Refer to caption
Figure 17: Performance of Transformer Models on Q&A Task with Varying Training Data Sizes

Appendix B BLEU Prediction Performance in Quality-Based Mitigation Strategy

The following Tables 3-5 summarize the BLEU prediction performance of various supervised techniques for translations generated by GPT-3.5, GPT-4, and LLAMA2, respectively.

Table 3: Performance of BLEU Prediction Models with GPT-3.5 Data
Model MAE MSE RMSE
Bayesian Ridge 0.1602 0.0415 0.2038
Ridge Regression 0.1601 0.0416 0.2038
Linear Regression 0.1602 0.0416 0.2039
Light Gradient Boosting Machine 0.1608 0.0417 0.2043
Orthogonal Matching Pursuit 0.1632 0.0428 0.2069
Extra Trees Regressor 0.1613 0.0421 0.2051
Gradient Boosting Regressor 0.1636 0.0427 0.2067
Extreme Gradient Boosting 0.1617 0.0430 0.2074
Table 4: Performance of BLEU Prediction Models with GPT-4 Data
Model MAE MSE RMSE
Bayesian Ridge 0.1627 0.0429 0.2072
Ridge Regression 0.1627 0.0430 0.2072
Linear Regression 0.1627 0.0430 0.2073
Light Gradient Boosting Machine 0.1631 0.0430 0.2073
Orthogonal Matching Pursuit 0.1657 0.0442 0.2103
Extra Trees Regressor 0.1636 0.0434 0.2083
Gradient Boosting Regressor 0.1659 0.0440 0.2098
Extreme Gradient Boosting 0.1643 0.0445 0.2109
Table 5: Performance of BLEU Prediction Models with LLAMA2 Data
Model MAE MSE RMSE
Bayesian Ridge 0.1350 0.0301 0.1735
Ridge Regression 0.1349 0.0301 0.1735
Linear Regression 0.1349 0.0301 0.1736
Light Gradient Boosting Machine 0.1374 0.0306 0.1749
Orthogonal Matching Pursuit 0.1381 0.0311 0.1764
Extra Trees Regressor 0.1397 0.0314 0.1771
Gradient Boosting Regressor 0.1400 0.0314 0.1772
Extreme Gradient Boosting 0.1371 0.0314 0.1772

Appendix C Performance Evaluation of AI Detection Classifiers

The following Tables 6-8 summarize the classification performance of AI detection models for LLM-generated translations. Tables 9-10 summarize the classification performance of AI detection models for translations generated by low-/high-performance transformer models.

Table 6: Performance of AI Detection Classifiers on GPT-3.5 Data
Model Accuracy AUC Recall Precision
Logistic Regression 0.6794 0.7468 0.6774 0.6796
Linear Discriminant Analysis 0.6800 0.7471 0.6777 0.6803
Extreme Gradient Boosting 0.6531 0.7183 0.6495 0.6538
Light Gradient Boosting Machine 0.6564 0.7206 0.6465 0.6590
Random Forest 0.6328 0.6897 0.6197 0.6359
Ada Boost 0.6169 0.6665 0.6214 0.6153
K Nearest Neighbors 0.6010 0.6433 0.6248 0.5959
Naive Bayes 0.5879 0.6268 0.6188 0.5823
Table 7: Performance of AI Detection Classifiers on GPT-4 Data
Model Accuracy AUC Recall Precision
Logistic Regression 0.6778 0.7429 0.6743 0.6791
Linear Discriminant Analysis 0.6785 0.7446 0.6751 0.6798
Extreme Gradient Boosting 0.6516 0.7158 0.6460 0.6533
Light Gradient Boosting Machine 0.6535 0.7176 0.6419 0.6572
Random Forest 0.6301 0.6851 0.6107 0.6353
Ada Boost 0.6152 0.6618 0.6094 0.6166
K Nearest Neighbors 0.5983 0.6380 0.6289 0.5926
Naive Bayes 0.5870 0.6250 0.6050 0.5840
Table 8: Performance of AI Detection Classifiers on LLAMA2 Data
Model Accuracy AUC Recall Precision
Logistic Regression 0.7316 0.8095 0.7394 0.7276
Linear Discriminant Analysis 0.7303 0.8087 0.7402 0.7254
Extreme Gradient Boosting 0.6987 0.7750 0.7068 0.6951
Light Gradient Boosting Machine 0.6925 0.7659 0.7103 0.6854
Random Forest 0.6628 0.7263 0.6810 0.6566
Ada Boost 0.6469 0.7034 0.6605 0.6425
K Nearest Neighbors 0.6347 0.6837 0.6507 0.6301
Naive Bayes 0.6104 0.6550 0.6631 0.5994
Table 9: Performance of AI Detection Classifiers on High-Performance Transformer Data
Model Accuracy AUC Recall Precision
Logistic Regression 0.8384 0.9209 0.8606 0.8237
Linear Discriminant Analysis 0.8343 0.9172 0.8726 0.8103
Extreme Gradient Boosting 0.7952 0.8853 0.8212 0.7803
Light Gradient Boosting Machine 0.7824 0.8707 0.8255 0.7597
Random Forest 0.7476 0.8306 0.7754 0.7343
Ada Boost 0.7260 0.8064 0.7336 0.7221
K Nearest Neighbors 0.6906 0.7607 0.6598 0.7025
Naive Bayes 0.6594 0.7189 0.6778 0.6533
Table 10: Performance of AI Detection Classifiers on Low-Performance Transformer Data
Model Accuracy AUC Recall Precision
Logistic Regression 0.6820 0.7528 0.7036 0.6740
Linear Discriminant Analysis 0.6825 0.7526 0.7093 0.6727
Extreme Gradient Boosting 0.6512 0.7159 0.6644 0.6469
Light Gradient Boosting Machine 0.6557 0.7200 0.6759 0.6491
Random Forest 0.6287 0.6848 0.6268 0.6287
Ada Boost 0.6227 0.6736 0.6359 0.6190
K Nearest Neighbors 0.5923 0.6265 0.5972 0.5909
Naive Bayes 0.5919 0.6303 0.6389 0.5835