Learning from Naturally Occurring Feedback

Shachar Don-Yehiya¹ Leshem Choshen ^2,3 Omri Abend¹
¹The Hebrew University of Jerusalem, ²MIT, ³MIT-IBM Watson AI Lab
{first.last}@mail.huji.ac.il

Abstract

Human feedback data is a critical component in developing language models. However, collecting this feedback is costly and ultimately not scalable. We propose a scalable method for extracting feedback that users naturally include when interacting with chat models, and leveraging it for model training. We are further motivated by previous work that showed there are also qualitative advantages to using naturalistic (rather than auto-generated) feedback, such as less hallucinations and biases. We manually annotated conversation data to confirm the presence of naturally occurring feedback in a standard corpus, finding that as much as $\sim$ 30% of the chats include explicit feedback. We apply our method to over 1M conversations to obtain hundreds of thousands of feedback samples. Training with the extracted feedback shows significant performance improvements over baseline models, demonstrating the efficacy of our approach in enhancing model alignment to human preferences.

Shachar Don-Yehiya¹ Leshem Choshen ^2,3 Omri Abend¹ ¹The Hebrew University of Jerusalem, ²MIT, ³MIT-IBM Watson AI Lab {first.last}@mail.huji.ac.il

Refer to caption — Figure 1: Naturally occurring feedback example. The user asks the model a question, and responds to its answer with an “Ask For Clarification” feedback and later with “Positive Feedback”.

1 Introduction

Human feedback is a valuable resource for model development. The current standard model training process includes a pretraining phase (Radford et al., 2019), followed by an alignment phase, where the model is usually fine-tuned and trained with reinforcement learning on human preference data (Bai et al., 2022), often iteratively (Touvron et al., 2023). The more data at hand, the better the model (Kaplan et al., 2020; Roberts et al., 2023). However, collecting such data usually requires costly human labor, limiting its scalability.

Humans nevertheless do not need commentators to know that their conversation partner is satisfied. Rather, they infer it from the communication itself. We suggest employing a similar rationale with language models (LMs), and extract natural human feedback (see Fig. 1). While naturalistic feedback can occur in many forms (e.g., the user continues to the next question once they are satisfied with the model’s response), in this work we focus on unambiguous and explicit cues, such as when the user directly refers to the quality of the model’s response (e.g., “thank you!”, or “that’s wrong”) or rephrases and asks the same question again (cf. §3.1).

With the introduction of general assistant models like ChatGPT (OpenAI et al., 2024) and OpenAsistant (Köpf et al., 2024), human-model interactions have become very prominent, not only among machine learning experts but also among the general public. Thus, there are huge amounts of data potentially available.

Unlike model as a judge methods (Liu et al., 2023b; Zheng et al., 2023b), naturally occurring feedback is anchored in the human response, and therefore is less prone to “hallucinations” (Lewis et al., 2021) and biases (Saito et al., 2023), and easier to explain and verify.

Another advantage of this approach is that this form of feedback is potentially closer to the feedback given by two human interlocutors (Bassiri, 2011; Werts et al., 1995), possibly containing relevant information for better alignment.

We manually annotate and show that naturally occurring feedback is indeed prevalent in conversation data (§3.2). Furthermore, we find that naturally occurring feedback is more common in recently collected data than in older data, possibly due to users raising their expectations and being able to conduct a more “human-like” conversation with the model (§4). This further underscores the importance of ever-growing data resources, over static datasets. Models keep improving and therefore the data used to align them should evolve too.

We introduce a method to automatically extract the naturally occurring feedback from human-model interactions (§5). We validate our method, both quantitatively and qualitatively, finding it manages to correctly extract the feedback to a reasonable degree. We use our method to obtain over $170k$ feedback samples from 1M non-annotated human-model conversations. We release it as a dataset (§5.3).¹¹1Code and data: https://github.com/shachardon/naturally_occurring_feedback, https://huggingface.co/datasets/shachardon/naturally_occurring_feedback

We use the extracted data to train a model to better align with human preferences. Our model demonstrated superior performance, outperforming the pretrained model in up to $78\%$ of the test cases (§6).

2 Background

To compile a preference dataset, human annotators are asked to rank/score the generated responses of large language models (LLMs) at the time of the interaction (Chiang et al., 2024), or in retrospect (Bai et al., 2022; Ethayarajh et al., 2022). To save this costly human effort, sometimes other models are doing the ranking (Cui et al., 2023; Lee et al., 2023; Zhu et al., 2023) at the expense of introducing noise and biases (Zheng et al., 2024). For example, it was shown that LLMs tend to prefer longer responses regardless of their quality (Saito et al., 2023).

Another line of work collects data samples online during the interaction, by eliciting free-text feedback from the user. This feedback is then used in various ways for training (Shi et al., 2022; Jin et al., 2023; Scheurer et al., 2022). Hancock et al. (2019) suggested estimating user satisfaction and only if it is low, to elicit feedback from the users.

We focus on naturally occurring feedback, i.e., spontaneous unsolicited feedback. When two humans talk, they do not score each other’s responses nor explicitly ask for feedback (at least not often). Rather, the interlocutors actively signal their understanding and agreement through the use of verbal and visual responses, such as “hmm”, “yeah” or facial expressions, head nods, etc. (Vranjes et al., 2018; Bavelas and Gerwing, 2011).

We show that also in a human-model textual conversation, such feedback signals exist. Finding them, ideally automatically, will allow us to extract freely annotated training examples. Employing such extraction on an endless stream of new conversation data (Don-Yehiya et al., 2023b) has the potential to grow and improve unboundedly.

3 Naturally Occurring Feedback

We begin by defining a taxonomy for naturally occurring feedback. We then manually annotate conversations to account for the statistics of such feedback types in conversations.

Throughout our discussion when we consider feedback we refer to (a part of) a human response that refers to (a part of) the last model’s response.

3.1 Feedback Taxonomy

We define the following categories, split into four negative feedback categories and one positive:

1.

Repeat or Rephrase (rephrase): The user repeats or rephrases their last response, explaining again what they wants.
2.

Make Aware with Correction (aware + correct): The user points to the model that it was wrong, and provides information regarding the error/how to fix it. E.g., No, I wanted…
3.

Make Aware Without Correction (aware -correct): The user points to the model that it was wrong, without providing any additional information. E.g., That’s incorrect
4.

Ask for Clarification (clarify): The user asks for additional resolution that was expected to be in the the previous response, but was missing. E.g., Was it like that?
5.

Positive Feedback (positive): The user confirms that the model did a good job, possibly thanking it. E.g., Thank you!

We now turn to motivating this set of categories. The two main design features are simplicity and text-anchoredness, i.e., the feedback should be directly and explicitly derived from the text, without requiring complex subjective interpretation.

Following this line, the feedback type that appears the most explicitly in the text is probably “Positive Feedback”. Although we found it to be less common (see §5.3), positive feedback can usually be recognized at the vocabulary level. The user thanks the model for its response (e.g., ty), says it did a good job (e.g., great!) or that it was right (e.g., that’s correct).

The negative feedback cases, on the other hand, are much more diverse. There are vocabulary-level feedback cases (that’s wrong), but also more semantically complex instances ("actually, I was asking about…"). Thus, using the feedback patterns from Petrak et al. (2023), we break the negative feedback cases into finer categories to avoid too general a definition. Also, more detailed categories provide additional information that can be used later for better training/inference.

We found the “Ask for Clarification” category to be somewhat in the middle in terms of sentiment and feedback nature between the Positive Feedback and the rest of the negative categories. The user asks for more information or confirmation, indicating that the model’s response was in the right direction, so not entirely wrong, but still provides some subtle feedback. This category is very common (see §5), and we expect these cases to be even more frequent as models improve.

Another distinction we found useful is between “Make Aware with Correction” and “Make Aware without Correction”. The first holds clear potential for training/inference, as the user provides information regarding the required fix. The latter is less useful, but still can be used as a strictly negative example (in contrast to Ask for Clarification).

’Repeat or Rephrase’ is unique compared to the other feedback forms as it does not leverage the model’s ability to process multi-turn interactions. Instead, the user ignores the previous response and rephrases again what they want, as if it was the beginning of the conversation. Assuming that the following model’s response would be better, the two one-turn user-model interactions can be used as a preference pair for training. However, it is important to note that the context of the full conversation is crucial to recognize this feedback form.

In their taxonomy, Petrak et al. (2023) also have an “Ignore and Continue” category, where the user ignores an error. We leave it out as it does not contain feedback, but rather implies that none was left despite an error. It is only meaningful when accompanied by an annotated error in the previous model response, which we do not have in our setting.

3.2 Manual Annotation

To get an initial impression of the distribution of categories in this taxonomy, one of the authors manually annotated the first $300$ conversations from the LMSYS-Chat-1M dataset (Zheng et al., 2023a, see §5.1). After filtering out non-English conversations and offensive/unsettling conversations, we were left with $223$ conversations. We find $77$ conversations with a total of $101$ feedback cases: $37$ Repeat or Rephrase, $18$ Make Aware with Correction, $13$ Make Aware without Correction, $24$ Ask for Clarification and $9$ Positive Feedback (see Fig. 2). The fact that $\sim 30\%$ of conversations include feedback is an encouraging result. As the percentage is so high it is likely that simple methods would already suffice to extract notable amounts of feedback data or easily filter for specific quality data.

To validate our manual annotation, we ask an in-house annotator to re-annotate the first $100$ conversations, of which $68$ pass the filtering. We get a Cohen’s kappa of $0.65$ for the binary task of feedback recognition. Of the feedback cases that both annotators agreed upon, they also agreed on the category in $0.79$ of the cases.

4 Up-to-Date Feedback

Comparing the models of two years ago with those of today seems like comparing apples to oranges, and even at a shorter time scale, the state-of-the-art advances rapidly (Beeching et al., 2023). In the interim, as models get better, users expect more. Users use the models for new scenarios that were not possible before (Zhao et al., 2024) and do so in a more natural way (except in extreme cases Don-Yehiya et al. (2023a)). We expect that with more fluent and diverse conversations comes more feedback.

We measure that empirically by comparing the new annotation of current models to annotation efforts of earlier models. Out of the six datasets that were annotated by Petrak et al. (2023), only the Self-Feeding Chatbot dataset (Hancock et al., 2019) is both human-model and open domain, and thus comparable. The Self-Feeding dataset was created in $2019$ , and so is the model that was used to generate it. Only $11$ feedback instances were found within a random sample of $100$ conversations. This is less than half the feedback frequency found in the newer LMSYS-Chat-1M dataset (omitting the positive feedback category as it was introduced by us). We note that there are $48$ annotated errors in the $100$ Self-Feeding dataset sample, and hence it is unlikely that it was a lack of errors that caused the users to give less feedback.

Our findings suggest that more than ever, naturally occurring feedback can serve as a valuable resource for feedback data. We believe in the future not only would models be better, continuing the above trend, but natural feedback itself may become a known resource, one which users expect the models to use.

5 Automatically Extracting Feedback

Given that natural feedback is already present in current datasets and ongoing human-model conversations, we propose a method to automatically extract this feedback.

Based on the five feedback categories (§3.1), we instruct an LLM to recognize spans – part of the human responses that contain feedback in a given conversation and classify them. We then use a Python script to parse the generated response and extract all the feedback instances. We discuss the implementation details next.

5.1 Extraction Implementation Details

Data.

We use the LMSYS-Chat-1M dataset (Zheng et al., 2023a), a collection of real-world conversations with 25 state-of-the-art LLMs. We select this dataset for its size and variety of state-of-the-art models and conversation topics. We filter out conversations with less than two turns, as there is no human feedback in a one turn conversation (one user query followed by one model response).

Model.

We use Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024) with 4-bit quantization to fit our GPUs. See App.§A for the full generation parameters. During development, we also experimented with Yi-34B-Chat (Young et al., 2024) and GPT-3.5, but found that Mixtral surpasses them. It is reasonable to assume that the non-quantized version of the model or other stronger models would allow more accurate feedback extraction. Our experiments here are intended as a POC, where the model and other aspects described below can be substantially improved if necessary for practical uses (see §Limitations).

Prompt.

After experimenting with a couple of versions, we found the prompt in Fig 3 to perform best. One key point is asking the model to provide its output in JSON format, to make parsing easier. Using few-shot examples seems to confuse the model, probably because of the length of the conversations and the difficulty to separate different conversations.

Parsing.

If the generated text contains the prompt, we delete the prompt. We then extract all JSON objects and confirm they contain the “User Response Pattern” and “User Response Text” fields. For each of the JSON objects we verify that the “User Response Text” is indeed contained in one of the user responses and that the category is valid (one of the 5 possibilities). If any of these do not hold, we discard the example.

5.2 Extraction Evaluation

To evaluate model performance in correctly identifying and classifying the naturally occurring feedback, we use the 300 manually annotated conversations from §3.2. We analyze the two abilities separately; extraction of feedback span and classifying it into the correct category. A feedback span is considered correct if it is a sub-string of a manually annotated feedback span and is at least half as long. We report both precision and recall. We define the span-precision as the percentage of the correctly identified feedbacks out of the total number of identified feedbacks. Correspondingly, the span-recall is the ratio of those correctly identified feedbacks, but divided by the total number of manually annotated feedbacks. We define the category precision/recall as the number of feedbacks that were both identified and classified correctly to the right category, divided by the total number of identified/manually annotated feedbacks respectively.

While our manual annotation found $101$ feedback cases, our automatic method found $134$ feedback cases, out of which $58$ are correct. Employing bootstrap with 1000 repetitions, this results to $0.43\pm 0.05$ precision and $0.58\pm 0.06$ recall. Taking the categories into account, we get $0.28\pm 0.04$ precision and $0.38\pm 0.06$ recall. Fig 4 presents the confusion matrix for the categories. We can see that “Repeat and Rephrase” and “Make Aware with Correction” are the most frequent categories for both manual annotation and the automatic method, and that the automatic method did not predict the “Make Aware without Correction” category at all. If we reduce it to binary categories, i.e., positive/negative, we can see that there is no confusion between the positive category and the negative categories.

Examining the false positives, we see that many of them are debatable. For example, the user started by telling the model “I am interested to know how you work”. The model then responded with “I am an AI language model that uses machine learning algorithms to understand and generate human-like text. I am trained on a large dataset of text… If you have any specific questions about how I work, feel free to ask!”. The user then asked “Can I make you on my own computer?” and our extraction method marked this text as a “Ask for Clarification” feedback case. We did not annotate this response as feedback as we considered it to be a new request and not a clarification of the previous one. However, we do see why it could be pointing to missing information in the original question.

Using better models for the extraction or finding better instruction prompts would probably improve results. Still, given that it seems that also the false positives encode a relevant signal, we hypothesize that this extracted feedback data already would be beneficial for preference training.

5.3 The Natural Feedback Dataset

Using the manual annotation as a test set, and our extraction method to acquire more feedback, we create a large Natural Feedback Dataset. We run the described extraction method on all 1M conversations of the LMSYS-Chat-1M dataset. After filtering out two turn conversations (see §5.1), we are left with $334,319$ conversations. We apply our method and end up with $173,859$ feedback examples from $115,312$ different conversations. See Fig. 5 for the category distribution. In terms of positive/negative examples, we have about $15$ times more negative examples, similar to the ratio we had in the manual annotation (§3.2). Note that this ratio is not surprising, as correcting a model is potentially beneficial for the user (helping the model to help me), while thanking it is less practical.

We examine the statistics of the conversations that were found to contain feedback. The average number of turns in a conversation in the LMSYS-Chat-1M dataset is $2$ , while the average number of turns in a conversation that contains feedback is $5.5$ . This is not surprising as the minimum number of turns in conversation that contains feedback is $2$ as user feedback can appear after at least one model response, i.e., starting from the second turn only. The average feedback turn is $3.1$ , and the average length of the feedback span is $52.5$ tokens.

6 Training on the Extracted Feedback

To demonstrate the usefulness of the extracted data, we use it to train LLMs and show the improvement.

Our data contains both positive and negative examples. We start by using the positive examples only, to finetune the models. We then present some initial results for preference training, with both positive and negative examples.

6.1 Training Details

We randomly split the positive examples to 80/20% for train/val data, remaining with $8448$ training examples. We use three models: EleutherAI/pythia-1.4b, EleutherAI/pythia-2.8b (Biderman et al., 2023), and mistralai/Mistral-7B-v0.1 (Jiang et al., 2023). For more details see App.§A.

6.2 Model Performance Evaluation

To measure the improvement of the models given the new data, we use the validation split of the OpenAssistant dataset (Köpf et al., 2024). We generate the last response with both our trained models and the corresponding pretrained models. See App.§A for the generation parameters.

Human Evaluation.

We perform human evaluation of the model outputs to acquire a reliable evaluation. To do so an in-house human annotator was asked to rate not consistently ordered pairs of $100$ model responses for each of the models, without knowing what model created which response (the pretrained baseline or the finetuned version). Our trained models won $69\%$ / $81.5\%$ / $77\%$ over their corresponding pretrained versions.

Evaluation by Open Models.

In addition to manual evaluation, we perform automatic evaluation, which allows more flexibility in the analysis. To prioritize replicable science, we also explore the usability of open models as evaluators in our scenario. The RewardBench leaderboard (Lambert et al., 2024) evaluates the capabilities of models in the task of rating model responses, and its top models outperform some closed models that are frequently used as judges. Based on the leaderboard, we take openbmb/Eurus-RM-7b (Yuan et al., 2024) and sfairXC/FsfairX-LLaMA3-RM-v0.1 (leaders of the leaderboard, when conducting these experiments), and compare the pretrained to the finetuned models responses. However, we were disappointed to find that these models do a poor job comparing the outputs of the smaller models. openbmb/Eurus-RM-7b reported $31\%$ and $38\%$ wins for the 1.4B and 2.8B models, and sfairXC/FsfairX-LLaMA3-RM-v0.1 reported $48\%$ and $60\%$ . We assume this is due to the distribution of the data they were trained on, which only represents stronger models. The results for the 7B model on the other hand are comparable to those we got in the human (and GPT-4) evaluation. openbmb/Eurus-RM-7b and sfairXC/FsfairX-LLaMA3-RM-v0.1 reported $70\%$ and $72\%$ wins for the trained model respectively.

GPT as a Judge.

To complete the picture, we run a GPT-4 as a judge evaluation (Zheng et al., 2023b). We use the Reward Bench (Lambert et al., 2024) implementation to instruct GPT-4 to compare response pairs. Our trained models won $65\%$ / $74\%$ / $78\%$ over their corresponding pretrained versions.

Conducting a binominal test on these results, we find that all above reported results are significant with $p<10e^{-9}$ . We conclude that our automatically extracted training data is indeed beneficial. Training on about $8k$ positive examples yields a significant improvement for all our tested model sizes, more as the model is larger.

6.3 Random Chats Baseline

As an additional baseline, we replace our extracted positive examples with a random sample of chat examples from the LMSYS-Chat-1M dataset of the same size. These examples are not necessarily positive, but they are in a chat format and of relatively well-performing models and therefore might be useful for knowledge distillation nonetheless (Honovich et al., 2023). We want to test whether training on our extracted data has any advantage over this randomly sampled data. We finetune the 7B model on them, and evaluate their performance. openbmb/Eurus-RM-7b reports $64\%$ wins, sfairXC/FsfairX-LLaMA3-RM-v0.1 reports $68\%$ wins, and GPT-4 reports $75\%$ wins, all outperformed by our main results. This strengthens our previous conclusion that our extracted data is beneficial.

6.4 KTO Training

So far, we have shown promising results for finetuning. While finetuning is a performant way to use feedback, it only trains on positive examples (see §6.1). To test the benefits of the negative examples we try other training methods.

The positive and negative examples are not arranged in pairs that use the same prompt and hence are not suitable for DPO training (Rafailov et al., 2024). Instead, we use KTO (Ethayarajh et al., 2024), which is capable of handling non-paired preference data. As mentioned in §5.3, there are many more negative examples than positive ones. To balance this, we use only the "Make Aware with Correction" and "Make Aware without Correction" categories, and on top of that we down-sample. We chose these categories as we assume their ’negative’ signal is the strongest. For the hyperparameters, see App.§A.

We run this experiment with the 7b model only, as KTO is not beneficial for smaller models (Ethayarajh et al., 2024). We start from the previously finetuned model and then train it with KTO.

We evaluate the model. openbmb/Eurus-RM-7b reports $74\%$ wins, sfairXC/FsfairX-LLaMA3-RM-v0.1 reports $75\%$ wins, and GPT-4 reports $79\%$ wins. These scores are all overwhelmingly better than the pretrained, and somewhat better than those we had for the finetuned model (about $1$ - $3$ points improvement). We conclude that our negative data (or at least some of the categories) is indeed useful for training.

7 Ablation Experiments

We analyze different aspects of our extraction method, including our choice of feedback taxonomy.

7.1 Taxonomy Effect on the Extraction

Here we examine the effect of the feedback taxonomy on the success of the model in accurately extracting the feedback spans from the conversations, finding that our model benefits from our taxonomy design decisions.

We examine several taxonomy alternatives. We evaluate each by calculating the precision and recall relative to the 300 manually annotated conversations from §3.2.

7.1.1 No Categories

Our taxonomy introduces $5$ different feedback categories (see §3.1). Here we examine whether there is even a need for any taxonomy at all.

We change our prompt such that it will not contain any category definition. We instruct the model to recognize spans of text that are informitive as to the satisfaction of the user, and rate them on a scale of 1-5. See App. §C for the prompt.

Running the model in this setting, the model found $693$ (!) text spans, while not even one of them matches the manually annotated feedback examples. Manually looking at a few of them, it seems that the model fails miserably at identifying relevant text spans. For example, it often suggests seeing the original user’s requests as an indication of user satisfaction (e.g., “Show me how to implement a toy version of a relational database. Begin by writing a toy query planner that convert SQL…”), which is of course not a valid user feedback as it precedes the model response.

We conclude that an overly general extraction prompt is harder for the model to handle, and that a detailed taxonomy is helpful for the automatic extraction process.

7.1.2 Limited Categorization

We examine the effect of using fewer feedback categories on the extraction process. This followed the hypothesis that focusing on a smaller set of categories would allow for better precision.

We limit ourselves to the “Repeat and Rephrase” and “Positive Feedback” categories, as we recognize that they are both relatively easier for the model to distinguish from the other categories (see 4). We instruct the model to extract these two types for feedback only. See App. §C for the prompt.

For the “Positive Feedback” category, the model manages to achieve $0.5$ precision for both span and category precision, i.e., all positive cases that the model found were classified correctly.

For the “Repeat or Rephrase” category, the model managed to achieve $0.43$ text-span precision and $0.17$ category precision. Given that there are only two possible categories, there is a relatively large gap between the span and category precision. Looking at the extracted examples, we see that the model tends to invent new categories, for example “Asking for Assistance” or “Ask for Examples”.

We conclude that focusing on fewer feedback categories is not necessarily easier for the model.

7.2 Confidence Level

We want to examine the usefulness of asking the model to generate a “confidence level” value, to better filter the extracted feedback samples such that we will get a higher precision score.

To do so, in addition to the “User Response Pattern” and “User Response Text” fields, we instruct the model to provide a “Confidence Level (1-5)” field. See App. §C for the prompt.

Looking at the distribution of confidence scores the model assigned, we find that over $96\%$ of the feedback cases received a $5$ score. The other $4\%$ are mostly “No Feedback” or hallucinations that are automatically removed at the parsing stage (see §5.1). We conclude that this method is ineffective.

8 Related Work

In addition to the LMSYS-Chat-1M dataset which we used due to its size and inclusion of multiple models, there are other recent, public conversation datasets such as WildChat (Zhao et al., 2024), Collective Cognition ChatGPT Conversations,²²2https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-10-16?row=11 and PRISM (Kirk et al., 2024).

Petrak et al. (2023) investigated the types of errors and user responses in $6$ different datasets. Although we adopt and modify their feedback taxonomy, we take two steps further. We focus on user responses and extract them automatically, and we show the importance of using up-to-date conversation data (§4), contrary to their conclusion.

We opted for KTO to train a model on our data, but there are many more options for training on non-positive examples (Christiano et al., 2017). Ouyang et al. (2022) and Shi et al. (2022) suggested to create possible corrections for negative examples and train on them. Other methods use pairs of positive and negative examples and train on them both to predict their scores (Liu et al., 2023a). Peng et al. (2024) and Wu et al. (2024) showed the possible gain of fine-grained feedback.

Another set of related works are ones creating synthetic data from datasets (Yehudai et al., 2024) or augmenting user feedback (Sudalairaj et al., 2024). We note their similarity in better utilizing human effort for creating data samples for training. Those, however, differ from our work in the problem they address. Such efforts rely on boosting existing training signals, whether found in the human explicit annotations, the model, or both. In contrast, our approach aims to identify signals in a scalable fashion. In fact, the output of our method can be used as their input (Bartolomé et al., 2024).

9 Discussion and Future Work

This paper advocates the use of naturally occurring feedback and introduces a method to extract it. We find that naturally occurring feedback is common in human-model chats. We use our method to extract over $170k$ feedback samples and train models on them, demonstrating their usefulness.

Our method can be improved with a better model, prompt, and more sophisticated extraction algorithm. Nonetheless, we hope our results encourage more work on naturally occurring feedback.

We observed in §4 that newer conversation data tends to contain more naturally occurring feedback. Buschmeier and Kopp (2018) showed the importance of “listener feedback” (subtle verbal signals, head gestures, and facial expressions) for the ability to communicate. They showed that this feedback encourages human interlocutors interacting with a model to provide more feedback by themself, and to rate the conversation as more helpful. Therefore, we expect voice assistant conversation data to contain even more feedback.

Another interesting line of future work is the incorporation of feedback into chats in real-time, with interactive reinforcement learning for example, or at least in a manner that would directly affect future user conversations, making giving feedback more beneficial for the user.

Limitations

Although the original LMSYS-Chat-1M dataset contains some non-English conversations, we filter those out during evaluation, as our annotators are not familiar with these languages.

As mentioned a couple of times along the paper, our automatic extraction method can be improved further to achieve better precision and recall. We believe that the fact that even the current relatively low precision data managed to achieve good training results underscores the importance and potential of naturally occurring feedback. With the abundance of data, future work might seek better precision or keep high recall depending on their goals. One could train on a lot of low quality data, focus on specific subsets of interest (e.g., a domain) or focus on quality annotation throwing a lot and still no ending up wanting, each requiring different precision-recall tradeoffs.

We used GPT as a judge and other models for evaluating the trained model. This approach is both costly and known to have biases (e.g., Dubois et al., 2024; Panickssery et al., 2024). Therefore, we use it only to complement the human evaluation and the open-models evaluation. We would like to emphasize that our extraction method itself does not use GPT or any proprietary models.

Ethics Statement

This work has been approved by the IRB of our institution. We abide by the terms and conditions of the LMSYS-Chat-1M dataset (see the license here ³³3https://huggingface.co/datasets/lmsys/lmsys-chat-1m#lmsys-chat-1m-dataset-license-agreement. As mentioned by the LMSYS-Chat-1M dataset authors, the LMSYS-Chat-1M dataset contains unsafe conversations that may be perceived as offensive or unsettling. The provided OpenAI moderation API tag can be used to filter it. We informed our annotators of this and instructed them to skip these conversations.

Acknowledgments and Thanks

We thank John (autoMeta) Cook, Ramon Astudillo and Ben Burtenshaw for the deep discussions that helped us converge with our thoughts.

References

Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
Bartolomé et al. (2024) Álvaro Bartolomé, Gabriel Martín-Blázquez, Agustín Piqueres-Lajarín, and Daniel Vila-Suero. 2024. Distilabel: An AI feedback (AIF) framework for building datasets with and for LLMs.
Bassiri (2011) Mohammad Amin Bassiri. 2011. Interactional feedback and the impact of attitude and motivation on noticing l2 form. English Language and Literature Studies, 1(2):61.
Bavelas and Gerwing (2011) Janet Beavin Bavelas and Jennifer Gerwing. 2011. The listener as addressee in face-to-face dialogue. International Journal of Listening, 25(3):178–198.
Beeching et al. (2023) Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
Buschmeier and Kopp (2018) Hendrik Buschmeier and Stefan Kopp. 2018. Communicative listener feedback in human-agent interaction: Artificial speakers need to be attentive and adaptive. In Proceedings of the 17th international conference on autonomous agents and multiagent systems, pages 1213–1221.
Chiang et al. (2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.
Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
Don-Yehiya et al. (2023a) Shachar Don-Yehiya, Leshem Choshen, and Omri Abend. 2023a. Human learning by model feedback: The dynamics of iterative prompting with midjourney. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4146–4161, Singapore. Association for Computational Linguistics.
Don-Yehiya et al. (2023b) Shachar Don-Yehiya, Leshem Choshen, and Omri Abend. 2023b. Sharelm: Crowd-sourcing human feedback for open-source llms together. https://sharelm.github.io/. (Accessed on 04/15/2024).
Dubois et al. (2024) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
Ethayarajh et al. (2022) Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding dataset difficulty with $\mathcal{V}$ -usable information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR.
Ethayarajh et al. (2024) Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
Hancock et al. (2019) Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3667–3684, Florence, Italy. Association for Computational Linguistics.
Honovich et al. (2023) Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2023. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts. Preprint, arXiv:2401.04088.
Jin et al. (2023) Di Jin, Shikib Mehri, Devamanyu Hazarika, Aishwarya Padmakumar, Sungjin Lee, Yang Liu, and Mahdi Namazifar. 2023. Data-efficient alignment of large language models with human feedback through natural language. arXiv preprint arXiv:2311.14543.
Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Kirk et al. (2024) Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The prism alignment project: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models. Preprint, arXiv:2404.16019.
Köpf et al. (2024) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. 2024. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
Lambert et al. (2024) Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. 2024. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787.
Lee et al. (2023) Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
Lewis et al. (2021) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-augmented generation for knowledge-intensive nlp tasks. Preprint, arXiv:2005.11401.
Liu et al. (2023a) Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2023a. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676.
Liu et al. (2023b) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023b. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
Panickssery et al. (2024) Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. Llm evaluators recognize and favor their own generations. Preprint, arXiv:2404.13076.
Peng et al. (2024) Andi Peng, Yuying Sun, Tianmin Shu, and David Abel. 2024. Pragmatic feature preferences: Learning reward-relevant preferences from human input. Preprint, arXiv:2405.14769.
Petrak et al. (2023) Dominic Petrak, Nafise Moosavi, Ye Tian, Nikolai Rozanov, and Iryna Gurevych. 2023. Learning from free-text human feedback – collect new datasets or extend existing ones? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16259–16279, Singapore. Association for Computational Linguistics.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Roberts et al. (2023) Adam Roberts, Hyung Won Chung, Gaurav Mishra, Anselm Levskaya, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. 2023. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8.
Saito et al. (2023) Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. 2023. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076.
Scheurer et al. (2022) Jérémy Scheurer, Jon Ander Campos, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. 2022. Training language models with language feedback. Preprint, arXiv:2204.14146.
Shi et al. (2022) Weiyan Shi, Emily Dinan, Kurt Shuster, Jason Weston, and Jing Xu. 2022. When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels. arXiv preprint arXiv:2210.15893.
Sudalairaj et al. (2024) Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D Cox, and Akash Srivastava. 2024. Lab: Large-scale alignment for chatbots. arXiv preprint arXiv:2403.01081.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
Vranjes et al. (2018) Jelena Vranjes, Geert Brône, and Kurt Feyaerts. 2018. Dual feedback in interpreter-mediated interactions: On the role of gaze in the production of listener responses. Journal of Pragmatics, 134:15–30.
Werts et al. (1995) Margaret G. Werts, Mark Wolery, Ariane Holcombe, and David L. Gast. 1995. Instructive feedback: Review of parameters and effects. Journal of Behavioral Education, 5(1):55–75.
Wu et al. (2024) Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2024. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36.
Yehudai et al. (2024) Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, and Leshem Choshen. 2024. Genie: Achieving human parity in content-grounded datasets generation. arXiv preprint arXiv:2401.14367.
Young et al. (2024) Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. 2024. Yi: Open foundation models by 01.ai. Preprint, arXiv:2403.04652.
Yuan et al. (2024) Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. 2024. Advancing llm reasoning generalists with preference trees. Preprint, arXiv:2404.02078.
Zhao et al. (2024) Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. Wildchat: 1m chatgpt interaction logs in the wild. arXiv preprint arXiv:2405.01470.
Zheng et al. (2024) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2024. Large language models are not robust multiple choice selectors. Preprint, arXiv:2309.03882.
Zheng et al. (2023a) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2023a. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. Preprint, arXiv:2309.11998.
Zheng et al. (2023b) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023b. Judging llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685.
Zhu et al. (2023) Lianghui Zhu, Xinggang Wang, and Xinlong Wang. 2023. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.

Appendix A Models and Parameters

All the models we used were released with a apache-2.0 license, except to llama3 which is released with the llama3 license and OpenAI models with their own terms of use.

To run our extraction process, we run the model with 0.2 temperature, 256 maximum new tokens, top-p 0.95, and 1.0 repetition penalty. Overall, the model processed approximately one conversation per 10 seconds on an NVIDIA 40A GPU.

We use a learning rate of $5e-7$ and RMSprop optimizer. We use NVIDIA RTX 6000 for the 1.4B model, NVIDIA A40 for the 2.8B and 2 A40 for the 7B models. To fit our GPUs we restrict the maximum input length to 1024, and accumulate gradients to achieve a batch size of $32$ . We run training for up to $20$ epochs, and select the best model according to its performance on the validation set.

For the KTO training, we use the same hyperparameters as in the finetuning experiment, and take the ones specific to KTO from an existing KTO implementation ⁴⁴4https://github.com/ContextualAI/HALOs.

Training each model took up to five days, depending on its size and the GPU used.

To evaluate the models, we use the same generation parameters as above: 0.2 temperature, 256 max new tokens, 0.95 top p, and 1.0 repetition penalty.

We used NVIDIA RTX 6000 for both generating the outputs and for running the open rewards models. Generating the outputs took up to two days for the 7B models and much less for the smaller ones. Running the rewards models took up to $15$ minutes for each. Using GPT-4 as a judge cost us $70\$$ .

Appendix B Annotators Instructions

Here we describe the annotators guidelines.

B.1 Manually Feedback Annotation

For the feedback annotation task, the annotator was given the following guidelines:

There are five different patterns in user responses subsequent to errors in assistant utterances: Repeat or Rephrase (UR1) - The user repeats or rephrases their concern, explaining again what they want. Make Aware with Correction (UR2) - The user points to the model that it was wrong, and provides information regarding the error/how to fix it. "No, I wanted…" Make Aware without Correction (UR3) - The user points to the model that it was wrong, without providing any additional information. "That’s incorrect" Ask for Clarification (UR4) - The user asks for additional resolution that was expected to be in the the previous response, but was missing. "Was it like that?”, “Can you provide a code solution for this?” Positive Feedback (UR5) - The user confirms that the model did a good job, possibly thanking it. "Thank you!", “Great!” Given these guidelines, please recognize such user responses in the following dialogue. Note that there might be conversations that contain unsafe, offensive or unsettling content. If you encounter such conversation, please skip it. To speed up the annotation, you can start by going through the user’s responses only, and then if there is a feedback candidate, look at the entire conversation context.

B.2 Human Evaluation Annotation

For the human evaluation annotation task we used the following guidelines:

Each example includes a prompt, and two possible assistant responses: response_a and response_b. Your task is to choose the better response (‘A’ or ‘B’). If the responses seems about the same, or you are unsure about which one to pick, pick ‘TIE’.

Note that some of the prompts already contain some assistant responses, as they are the beginning of a multi-trun conversation. In that case, your task is to choose the better next assistant response.

Example:

prompt:

# user: Describe the impact of climate change on coastal cities.

# assistant: response_a:

# user: Describe the impact of climate change on coastal cities.

# assistant: Climate change leads to rising sea levels, which can cause flooding in coastal cities. It also increases the frequency of extreme weather events, damaging infrastructure and displacing residents. Efforts to mitigate these impacts include building sea walls and developing better drainage systems. response_b:

# user: Describe the impact of climate change on coastal cities.

# assistant: Climate change affects coastal cities by causing sea levels to rise and increasing the frequency of severe storms. These changes result in frequent flooding, infrastructure damage, and forced relocation of communities. Strategies to combat these effects include constructing barriers like sea walls, enhancing urban drainage, and implementing comprehensive coastal management plans.

Answer: ‘B’

Appendix C Alternative Extraction Prompts

For the “No Category” setting, we use the prompt in Fig. 6.

For the “Limiting Categories” setting, we use the prompt in Fig. 7.

For the Confidence Level setting, we use the prompt in Fig. 8.

Appendix D Ai Assistants In Research Or Writing

We used copilot for writing code scripts, and also used Chat-GPT a little for sentence rephrasing.