Filtered Direct Preference Optimization

Tetsuro Morimura, Mitsuki Sakamoto¹¹footnotemark: 1, Yuu Jinnai, Kenshi Abe, Kaito Ariu
CyberAgent Equal Contribution. Correspondence to: Tetsuro Morimura <[email protected]>, Mitsuki Sakamoto <[email protected]>

Abstract

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.

Tetsuro Morimura^†^†thanks: Equal Contribution. Correspondence to: Tetsuro Morimura <[email protected]>, Mitsuki Sakamoto <[email protected]>, Mitsuki Sakamoto¹¹footnotemark: 1, Yuu Jinnai, Kenshi Abe, Kaito Ariu CyberAgent

1 Introduction

Large language models (LLMs) have become pivotal in performing various language processing tasks, such as text generation, dialogue, and summarization (Radford et al., 2019; Brown et al., 2020; Chowdhery et al., 2023). Aligning these models with human preferences and ethical standards is paramount to ensuring they are practical, trustworthy, and socially accepted (Bender et al., 2021; Bommasani et al., 2022). Reinforcement learning from human feedback (RLHF) is developed to tackle this challenge, aiming to enhance LLM performance by leveraging human feedback (Ouyang et al., 2022; Bai et al., 2022; Lin et al., 2022; Touvron et al., 2023; Casper et al., 2023).

Refer to caption — Figure 1: Performance comparison of alignment methods using a 160M LM with the AlpacaFarm dataset (Dubois et al., 2023), where the gold rewards are adjusted so that the average reward of the initial LM is zero. (A) shows the impact of dataset quality on RLHF (Ouyang et al., 2022) and DPO (Rafailov et al., 2023), with DPO exhibiting greater sensitivity to dataset quality variations. (B) compares the performance of DPO and the proposed fDPO on a mixed-quality dataset, illustrating that fDPO effectively mitigates the impact of data quality variations.

RLHF operates by taking a preference dataset and a language model (LM) as inputs to produce an LM refined by these preferences (Ouyang et al., 2022). It is broadly divided into two approaches concerning the use of a reward model (RM): RM-based RLHF, which learns an RM from the preference dataset and then uses it to optimize an LM through reinforcement learning (RL), and an RM-free approach that directly adjusts an LM based on preference data. This division mirrors the distinction between offline model-based and model-free RL (Sutton and Barto, 2018).¹¹1RM-based RLHF first estimates the environment (specifically, the reward function; we do not need to estimate a state transition function because it is known in NLG tasks) and then optimizes an LM under the estimated environment. This approach is in itself a form of model-based RL. Each approach offers unique advantages and requires careful application based on specific goals and contexts. For instance, in scenarios with limited data, model-based RL might be preferable due to its data efficiency, though its computational cost is generally higher than that of model-free RL (Moerland et al., 2022; Levine et al., 2020). Consequently, RM-based RLHF may be more effective in leveraging data than RM-free methods, despite the higher computational cost and algorithmic complexity.

Direct preference optimization (DPO) is a representative method of the RM-free RLHF (Rafailov et al., 2023). DPO reformulates the RL problem as a type of supervised learning problem, bypassing key challenges in RM-based RLHF, such as the need for reward modeling and balancing exploration and exploitation in RL fine-tuning. Thus, DPO simplifies the learning process. However, this approach relies solely on the initially given preference dataset for training, similar to supervised learning. This reliance might make DPO more sensitive to the quality of the preference dataset, potentially more so than other RLHF methods.

In this paper, we explore the impact of preference dataset quality on the performance of LMs optimized by DPO, specifically focusing on the quality of response texts rather than labeling accuracy. We demonstrate that DPO is more affected by text quality variations within the dataset than typical RLHF methods, as shown in Figure 1 (A). Notably, we observe that lower-quality data can create performance bottlenecks. In realistic applications of LLM alignment, the quality of responses can be highly diverse due to several factors such as differing skill levels among experts creating responses and the need to combine manually generated responses with those automatically generated by LLMs to manage annotation costs. This quality variation in response quality can severely impact performance of DPO.

In response to this challenge, we introduce a novel approach named filtered direct preference optimization (fDPO), which aims to harness potential data efficiency advantages of RM-based RLHF. It uses a trained RM to identify and discard samples of lower quality than those generated by an LM during fine-tuning. Our experiments show that fDPO significantly enhances the effectiveness of DPO, as illustrated in Figure 1 (B).

For simplicity, we will henceforth refer to RM-based RLHF simply as RLHF, unless a distinction is necessary. This study’s contributions are threefold:

•

We confirm that the quality of the preference dataset significantly influences the performance of LMs optimized with DPO whereas it has less impact on LMs optimized by standard RLHF.
•

We introduce fDPO, a practical solution that uses an RM to identify and discard lower-quality data, effectively addressing the dataset quality issue.
•

Our experiments with two distinct datasets demonstrate that fDPO substantially enhances the performance of LMs.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 explains the background. In Section 4, we detail the proposed method, fDPO, explaining its mechanisms and the rationale behind its design. Section 5 presents the experimental results, illustrating the effectiveness of fDPO and its impact on LM performance. Finally, Section 6 concludes the paper, and Section 7 discusses limitations and directions for future work.

2 Related work

We examine methods for aligning LMs with human preferences, focusing on RLHF and its alternatives. Most RLHF approaches utilize an RM (Ouyang et al., 2022; Touvron et al., 2023; Dubois et al., 2023; Casper et al., 2023). These methods fine-tune LMs using RL algorithms such as REINFORCE (Williams, 1992; Rennie et al., 2017), proximal policy optimization (PPO) (Schulman et al., 2017), or their variants Sutton and Barto (2018). However, there are notable reinforcement-learning-free approaches (Zhao et al., 2023; Liu et al., 2024), and learning-free methods that leverage the RM at decoding time, with best-of-N (BoN) sampling being a prominent example (Stiennon et al., 2020; Nakano et al., 2021).

A significant challenge in these methods is the estimation error of RMs, which can lead LMs to overfitting to a proxy reward, a phenomenon termed RM overoptimization (Gao et al., 2023). Various strategies have been proposed to address this issue, including RM ensembles (Coste et al., 2023; Eisenstein et al., 2023), uncertainty evaluation (Zhang et al., 2024), and analysis of out-of-distribution (Pikus et al., 2023; Kirk et al., 2024). Pace et al. (2024) proposes using BoN sampling to improve the data used for reward modeling, which is relevant to our fDPO approach focusing on dataset quality. As fDPO also leverages an RM, it can benefit from these developments.

DPO and its extensions (Azar et al., 2023; Tang et al., 2024; Pal et al., 2024; Singh et al., 2024) represent significant RM-free methods. Some DPO variants explore different regularizations (Wang et al., 2024) or use a divided dataset for stepwise training (Gou and Nguyen, 2024). Other variants propose adapting DPO online (Xu et al., 2023; Guo et al., 2024) or evaluating the quality difference between chosen and rejected responses for adding an offset to the DPO objective function (Amini et al., 2024) or incorporating curriculum learning (Gou and Nguyen, 2024). These approaches focus on response quality, which is relevant to our method.

Despite various advancements in DPO, the dependence on preference dataset quality has not been thoroughly analyzed. Our study aims to explore this significant dependence and attempts to refine the dataset for better performance. Additionally, our proposed fDPO method complements most of these developments. Integrating fDPO with these methods is an exciting possibility for future work, potentially leading to even more effective ways to align LMs with human preferences.

3 Background

This section explains RLHF in Section 3.1 and explores DPO in Section 3.2.

3.1 Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) frames the application of human feedback to enhance performance of a language model (LM) within the context of an RL problem. The process incorporates a pre-trained LM $\pi_{\theta}(y\,|\,x)$ , with $\theta$ denoting model parameters, $x$ the prompt, and $y$ the associated response. It also includes a demonstration dataset $\mathcal{D}_{\rm demo}$ for initial supervised fine-tuning and a preference dataset $\mathcal{D}$ for further RL fine-tuning. The aim is to refine the LM $\pi_{\theta}$ with these datasets $\mathcal{D}_{\rm demo}$ and $\mathcal{D}$ . We will present an overview of the widely studied RLHF pipeline (Ouyang et al., 2022), establishing the notations and concepts for understanding our contributions. The RLHF pipeline comprises three principal phases: (i) supervised fine-tuning, (ii) reward modeling, and (iii) RL fine-tuning.

Supervised fine-tuning.

Supervised fine-tuning (SFT) refines a pre-trained LM $\pi_{\theta}$ through supervised learning using demonstration data $\mathcal{D}_{\rm demo}$ from downstream tasks such as dialogue, instruction following, or summarization. This step steers $\pi_{\theta}$ towards desirable responses $y$ given prompts $x$ , laying the groundwork for the more complex RL fine-tuning steps in the RLHF pipeline. The resulting LM is called the SFT model.

Reward Modelling.

The reward modeling phase constructs a reward model (RM) $r_{\phi}(x,y)$ with a parameter $\phi$ to capture human preferences. This is achieved using a preference dataset, $\mathcal{D}=\{(x^{(i)},y_{c}^{(i)},y_{r}^{{(i)}})\}_{i=1}^{N}$ , where for each prompt $x$ , $y_{c}$ denotes the response chosen by a human, and $y_{r}$ is the rejected response. The variable $N$ denotes the total number of samples in the dataset.

To estimate the probability that a given response is preferred over another, the RM $r_{\phi}$ utilizes the Bradley-Terry model (Bradley and Terry, 1952), which is formulated as:

\displaystyle p_{\textrm{BT}}

\displaystyle(y_{c}\succ y_{r}\,|\,x,r_{\phi})=\sigma(r_{\phi}(x,y_{c})-r_{% \phi}(x,y_{r})),

where $\sigma(x)=\frac{1}{1+\exp(-x)}$ is the sigmoid function. The RM is trained by maximizing the following log-likelihood of the observed preferences in the dataset:

\displaystyle L(\phi)=\mathbb{E}_{(x,y_{c},y_{r})\sim\mathcal{D}}[\log\sigma(r% _{\phi}(x,y_{c})-r_{\phi}(x,y_{r}))]

(1)

This training process aims to assign higher scores to responses that humans prefer, thus enhancing the RM’s ability to predict human-like responses.

RL fine-tuning.

The RL fine-tuning phase uses the learned RM $r_{\phi}$ to optimize the SFT model $\pi_{\theta}$ . The goal is to enhance $\pi_{\theta}$ by maximizing the expected reward while maintaining closeness to the reference LM $\pi_{\rm ref}$ , striking a balance that avoids large deviations from the pre-trained behavior. The SFT model before RL fine-tuning is often used as $\pi_{\rm ref}$ . This is achieved through policy gradient methods like proximal policy optimization (PPO) (Schulman et al., 2017). The optimization problem is formalized as

	$\displaystyle\max_{\theta}\mathbb{E}_{x\sim\mathcal{D}}\bigg{[}\mathbb{E}_{y% \sim\pi_{\theta}(\cdot\,\|\,x)}[r_{\phi}(x,y)]$
	$\displaystyle\hskip 55.48286pt-\beta D_{\textrm{KL}}(\pi_{\theta}(\cdot\,\|\,x)% ,\pi_{\rm ref}(\cdot\,\|\,x))\bigg{]},$		(2)

where $D_{\textrm{KL}}$ is Kullback–Leibler (KL) divergence of a distribution $p$ from another distribution $q$ , defined as

D_{\textrm{KL}}(p,q)=\mathbb{E}_{y\sim p}\left[\log\frac{p(y)}{q(y)}\right].

Here, $\beta$ is a hyperparameter that controls the penalty for the deviations from $\pi_{\rm ref}$ .

3.2 Direct Preference Optimization

Direct preference optimization (DPO) reformulates the above reward modeling and RL fine-tuning phases to a single optimization problem (Rafailov et al., 2023). While DPO essentially follows the same loss function under the Bradley-Terry model (Eq. 1), it is an RM-free approach that aligns the SFT model $\pi_{\theta}$ directly with the preference data.

The objective function of DPO is defined as follows: aiming to maximize the ratio of probabilities for the chosen responses, optimizing the LM to imitate human preferences:

	$\displaystyle L_{\rm DPO}(\theta)$
	$\displaystyle=\mathbb{E}_{(x,y_{c},y_{r})\sim\mathcal{D}}\!\bigg{[}\log\sigma% \bigg{(}\beta\log\frac{\pi_{\theta}(y_{c}\,\|\,x)}{\pi_{\rm ref}(y_{c}\,\|\,x)}$		(3)
	$\displaystyle\hskip 99.58464pt-\beta\log\frac{\pi_{\theta}(y_{r}\,\|\,x)}{\pi_{% \rm ref}(y_{r}\,\|\,x)}\bigg{)}\bigg{]},$

where $\beta$ is a hyperparameter and has a similar role in Eq. (2). As the objective function indicates, DPO simplifies the optimization process by not requiring the generation of responses $y$ from $\pi_{\theta}$ during training, unlike the standard RL fine-tuning of Eq. (2). This approach, akin to supervised learning, makes DPO accessible and easy to use.

4 Filtered Direct Preference Optimization

In this section, we propose an approach called filtered direct preference optimization (fDPO), which refines the dataset used in DPO. The principle of fDPO is straightforward: it aims to discard lower-quality samples compared to those generated by the LM. This strategy is intuitively derived from observing that lower-quality data can create performance bottlenecks in DPO. First, we give an implementation of fDPO in Section 4.1. Then, we will elaborate on the motivation of fDPO by analyzing DPO’s behavior in Section 4.2.

4.1 fDPO Implementation

fDPO needs to assess the quality of responses for filtering. For this purpose, a straightforward approach is to use an RM. This incorporation of an RM diverges from the RM-free nature of the original DPO, aligning fDPO closer to RM-based RLHF approaches and making DPO more effective in leveraging data.

Algorithm 1 details the pseudo-code for fDPO implementation, which follows the standard RLHF pipeline in Section 3.1 except for RL fine-tuning. Instead of RL fine-tuning, DPO fine-tuning with filtering is employed. At the start of each training epoch in Step 3, the quality of each sample in the preference dataset is evaluated with a trained RM $r_{\phi}$ . Samples with chosen responses deemed to be of lower quality than those the LM $\pi_{\theta}$ generates are discarded. Specifically, for each prompt $x$ in the dataset, $\pi_{\theta}$ generates a response $y$ , and $r_{\phi}$ scores $y$ and the chosen response $y_{c}$ . If the score of $y$ is higher than that of $y_{c}$ , the corresponding sample $(x,y_{c},y_{r})$ is excluded from training.

The learning process itself mirrors that of DPO but introduces the aforementioned data refinement step. This refinement step aims to create a more effective training dataset, thereby improving the LM’s alignment with human preferences.

Algorithm 1 filtered direct preference optimization (fDPO)

1:LM

\pi_{\theta}

, RM

r_{\phi}

, demonstration data

\mathcal{D}_{\rm demo}

, preference data

\mathcal{D}_{\rm pref}

, and maximum epoch

M

2:Step 1: Supervised fine-tuning. Train

\pi_{\theta}

\mathcal{D}_{\rm demo}

3:Step 2: Reward modeling. Train

r_{\phi}

\mathcal{D}_{\rm pref}

(see Eq. (1)).

4:Step 3: DPO fine-tuning with filtering.

5:Initialize filtered-preference dataset

\mathcal{D}_{\rm f}:=\mathcal{D}_{\rm pref}

, epoch number

m:=0

6:while

m<M

7: for each

(x,y_{c},y_{r})

\mathcal{D}_{\rm f}

8: Generate response

y

by LM

\pi_{\theta}

given prompt

x

9: if

r_{\phi}(x,y)>r_{\phi}(x,y_{c})

then

10: Discard

(x,y_{c},y_{r})

from

\mathcal{D}_{\rm f}

11: end if

12: end for

13: Update LM

\pi_{\theta}

\mathcal{D}_{\rm f}

for one epoch using DPO.

14: Increment epoch number

m:=m+1

15:end while

16:return Optimized LM

\pi_{\theta}

4.2 Background and Motivation for fDPO

The motivation for developing fDPO stems from the observation that the quality of data in DPO significantly affects the performance of the resulting LM. More specifically, upon differentiating the objective function of DPO in Eq. (3), we obtain

	$\displaystyle\nabla_{\theta}L_{\rm DPO}(\theta)$		(4)
	$\displaystyle=\beta\mathbb{E}_{(x,y_{c},y_{r})\sim\mathcal{D}}\bigg{[}% \underbrace{w_{\theta}(x,y_{c},y_{r})\nabla_{\theta}\log\pi_{\theta}(y_{c}\|x)}% _{\textrm{increase likelihood of }y_{c}}$
	$\displaystyle\hskip 76.82243pt\underbrace{-w_{\theta}(x,y_{c},y_{r})\nabla_{% \theta}\log\pi_{\theta}(y_{r}\|x)}_{\textrm{decrease likelihood of }y_{r}}\bigg% {]},$

where $w_{\theta}$ is a weight function defined as follows:

	$\displaystyle w_{\theta}(x,y_{c},y_{r})$
	$\displaystyle=\sigma\!\bigg{(}\!\beta\log\frac{\pi_{\theta}(y_{c}\,\|\,x)}{\pi_% {\rm ref}(y_{c}\,\|\,x)}-\beta\log\frac{\pi_{\theta}(y_{r}\,\|\,x)}{\pi_{\rm ref% }(y_{r}\,\|\,x)}\!\bigg{)}.$

Equation (4) highlights that DPO, while adaptively adjusting sample weights, inherently aims to increase the generation probability for chosen responses and decrease it for rejected ones. This approach can lead to two types of problems: 1) diminished generation probability for high-quality responses labeled as rejected, and 2) increased generation probability for low-quality responses labeled as chosen.

Concerns regarding the first case, where high-quality responses are classified as rejected, might be insignificant. In such a case, while the generation probabilities of several high-quality responses decrease, the capability of LMs could remain robust. This is because their extensive diversity of potential responses will ensure that suppressing some responses does not substantially reduce the LM’s capacity to generate other high-quality alternatives.

Conversely, the more critical issue arises when low-quality responses are labeled as chosen. In such cases, their generation probabilities increase. This increase is particularly problematic because the probabilities of potential responses sum to one, meaning an increase in the probability of low-quality responses invariably decreases the share of high-quality responses. This shift substantially directs the learning process toward suboptimal outputs and declines the overall performance of LMs. A more detailed analysis of the sensitivity comparison between chosen and rejected responses will be provided in Appendix B.

Building upon these insights, fDPO effectively addresses the issue of increased generation probability for low-quality chosen responses. It tackles these bottlenecks by discarding samples where the chosen responses are of lower quality compared to those generated by the LM $\pi_{\theta}$ , as evaluated according to an RM. Through this process of consistent refinement, fDPO performs DPO on the improved dataset, thereby enhancing DPO’s effectiveness and ensuring a more effective alignment with human preferences.

5 Experiments

We first detail our setup regarding pretrained models in Section 5.1 and datasets in Section 5.2. We then evaluate the impact of data quality on DPO in Section 5.3 and the effectiveness of fDPO in Section 5.4 on instruction following tasks using the AlpacaFarm dataset Dubois et al. (2023), focusing on the general ability to generate appropriate responses to prompts. Furthermore, we assess fDPO on the Anthropic HH datasets Bai et al. (2022) in Section 5.5, under a realistic setting where there are two types of responses: dataset responses and those generated by the SFT model. This setup closely mimics real-world applications, where the system must handle both pre-existing and newly generated responses. For our baseline comparison, we use DPO and PPO-based RLFH implementations from the Transformer Reinforcement Learning (TRL) library.²²2https://github.com/huggingface/trl All experiments are conducted using a single NVIDIA A100 accelerator. The experiments using the AlpacaFarm dataset took approximately 3 days, and those using the Anthropic HH datasets took about 9 days of computation time. Details of the experimental parameters are provided in Appendix C.1.

5.1 Pretrained Models

We employed pretrained LMs provided in the Pythia suite by Biderman et al. (2023) of two different sizes: 1.4B and 160M models, in experiments on the AlpacaFarm dataset. In experiments on the Anthropic HH datasets, we used the 2.8B-sized Pythia model. Due to computational resource constraints, a comprehensive examination of the 160M LM is provided in Sections 5.3 and 5.4. In the preliminary setup, each LM was subjected to SFT using the demonstration data in the AlpacaFarm dataset or the chosen responses from the preference data in the Anthropic HH datasets, as the Anthropic HH datasets do not contain demonstration data. These prepared SFT models, denoted as $\pi_{\theta}$ , were then used as the initial LMs for our experiments.

For the (proxy) RM, we used Pythia models of varying sizes: 14M, 70M, and 160M models, with 160M being the default unless otherwise specified. To circumvent the high costs associated with human evaluation, similar to other studies Dubois et al. (2023); Rafailov et al. (2023), we utilized a large-scale human preference model as the gold RM. Specifically, “OpenAssistant/reward-model-deberta-v3-large-v2”³³3https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2 model was employed for this purpose. We adjusted the reward zero point such that the average reward of the initial LM (SFT model) is set to zero. Additionally, in Section 5.5, we employed GPT-4o for evaluation as an alternative to human assessment, accessed via Azure OpenAI.⁴⁴4https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models The specific model used was “gpt-4o” with the version “2024-05-13”.

5.2 Datasets

We used the AlpacaFarm dataset (Dubois et al., 2023) and the Anthropic HH datasets Bai et al. (2022). The AlpacaFarm dataset consists of 169,352 demonstration (SFT) samples, 20,000 training samples, and 2,000 test samples. The Anthropic HH datasets include two subtypes of datasets: helpfulness and harmlessness datasets. The former consists of 43,835 training samples and 2,354 test samples. The latter consists of 42,537 training samples and 2,312 test samples.

The baseline DPO and our proposed fDPO used the same data to ensure a fair comparison. This means that in fDPO, both the RM and the LM were trained using an identical dataset.

Given our focus on dataset quality, in experiments on the AlpacaFarm dataset, we employed gold RM and BoN sampling (Stiennon et al., 2020; Nakano et al., 2021) to create three types of pairwise preference datasets:

Low-quality dataset.

This dataset was created using the conventional manner. For each prompt $x$ , the LM $\pi_{\theta}$ generated two responses. These responses were then evaluated by the gold RM, with the higher-scoring response designated as $y_{c}$ and the other as $y_{r}$ . This formed the preference dataset $\mathcal{D}$ samples $(x,y_{c},y_{r})$ . For brevity, this dataset is referred to as the low dataset.

High-quality dataset.

Adopting the approach from (Pace et al., 2024), we used BoN sampling to create responses of higher quality. Specifically, for each prompt $x$ , the LM $\pi_{\theta}$ generated $16$ responses. These responses were then evaluated by the gold RM, and the highest-scoring response was selected as $y_{c}$ , with one randomly selected from the remaining 15 responses labeled as $y_{r}$ . Due to the probabilistic nature of outputs of $\pi_{\theta}$ , this approach is likely to yield $y_{c}$ responses of higher quality (as indicated by gold RM scores) compared to the $y_{c}$ responses in the low-quality dataset. For simplicity, this dataset is referred to as the high dataset.

Mix-quality dataset.

This dataset was created by mixing the low-quality and high-quality datasets in a 50/50 ratio, ensuring no overlap in prompts between the two. This dataset is referred to as the mix dataset.

We provide the evaluation scores of the gold RM for these datasets in Table 4 in Appendix C.

For experiments on the Anthropic HH datasets, we created mix-quality datasets by combining original responses from the dataset and those generated by the SFT model. Details are provided in Section 5.5.

5.3 Effect of Data Quality to Performance of RLHF and DPO

Our preliminary experiment investigates the sensitivities of (RM-based) RLHF and (RM-free) DPO to the quality of the datasets employed with the 160M-sized LM. Here, we used the high-quality and mixed-quality datasets. For RLHF, the 70M-sized RM was trained from the same datasets and used for RL fine-tuning with PPO. The evaluation is based on five independent runs.

Figure 1 (A) shows the results, where the mean and standard error of the gold reward with five independent runs are presented. Notably, while DPO experienced a decline in efficacy when trained on the mixed-quality dataset relative to the high-quality one, RLHF showed an intriguing resilience, sustaining comparable performance levels across both datasets. This differential impact starkly highlights the greater susceptibility of DPO to dataset quality, suggesting that the RM-based approach, including fDPO, may offer more stable performance when the preference dataset quality cannot be consistently assured. However, RLHF’s overall gold reward was lower than DPO’s. Therefore, subsequent experiments focus on DPO.

5.4 Evaluation of fDPO on AlpacaFarm dataset

We evaluate fDPO and DPO when trained using a 1.4B-sized LM $\pi_{\theta}$ on the mixed-quality dataset, where fDPO used a 160M-sized RM that was trained with the same dataset. The evaluation is based on five independent runs. The epoch number for DPO was set to 5, which avoided overoptimization while ensuring the learning convergence. In the case of fDPO, we adapted the epoch count to double that of DPO, up to 10 epochs.

Figure 2 present the results of DPO and fDPO. The results shows that the performance of DPO trained on the high-quality dataset and fDPO trained on the mixed-quality dataset were on par. It indicates that fDPO has successfully circumvented the performance decline typically observed with DPO, thereby showcasing its potential to improve DPO performance where dataset quality is inconsistent. Corresponding learning curves are included in Appendix C.

5.4.1 Detailed Evaluation

We examines an extensive analysis of fDPO using a 160M LM. We set the number of epochs to $8$ for DPO to ensure convergence, resulting in a maximum of $16$ epochs for fDPO.

Performance comparison with DPO.

Figure 1 (B) illustrates the performances of LMs trained with DPO and fDPO using the mixed-quality dataset. The results are consistent with those obtained from the larger 1.4B-sized LM, reaffirming the advantage of fDPO with the mixed-quality dataset. Additionally, we conducted an experiment using only a low-quality dataset, which revealed a significant improvement of 4.10% (standard error: 1.87%) despite the presumed uniformity of response quality. This improvement suggests it effectively discriminates subtle quality variations, enhancing overall performance by eliminating less optimal data, even within uniformly labeled datasets.

Analysis of configuration parameters.

Dataset	Method	Gold RM Score (SFT= $0.0$ ) $\uparrow$	GPT-4o Evaluation (win rate vs. SFT) $\uparrow$
Helpful	DPO	1.42 $\pm$ 0.08	0.543 $\pm$ 0.015
Helpful	fDPO	1.94 $\pm$ 0.02	0.628 $\pm$ 0.001
Harmless	DPO	2.66 $\pm$ 0.12	0.891 $\pm$ 0.003
Harmless	fDPO	3.20 $\pm$ 0.06	0.944 $\pm$ 0.005

Table 1: Evaluation on the Anthropic HH datasets. The values represent the mean and standard error over 3 seeds.

We investigated various aspects of fDPO, including the size of RMs, the randomness of LMs, and the criteria for the sampling filtering, with the mix-quality dataset. Figure 5 displays the impact of RM size. Consistent with findings from Ouyang et al. (2022), smaller RMs relative to the LM size yielded better performance. This contrasts with studies advocating larger RMs for improved performance (Gao et al., 2023; Coste et al., 2023), highlighting an area for further detailed analysis.

Reducing randomness of LMs during the filtering process was hypothesized to enhance fDPO’s performance by minimizing the variance in quality of the LM-generated responses used for filtering training samples. The idea was that more consistent response quality would lead to more reliable filtering decisions. However, as Figure 5 indicates, reducing randomness did not yield improvements, and in some cases, it led to worse performance. This outcome may be attributed to a discrepancy between inference-time and training-time randomness.

Finally, we explored different criteria for discarding data. As stated in line 8 of Algorithm 1, the original criterion was discarding a sample even if the reward of the LM-generated response $y$ is only marginally higher than that of $y_{c}$ in the dataset. Considering potential errors in proxy rewards and the probabilistic nature of LMs, we introduced a margin $\epsilon$ to the discarding criterion: $r(x,y)>r(x,y_{c})+\epsilon$ . Figure 5 presents the results, showing that larger margins generally lead to a decrease in performance, with the best results achieved when no margin is applied. This suggests that setting a margin $\epsilon$ is not necessary for enhancing fDPO’s performance. We further examined how samples were selectively discarded throughout the learning process of fDPO in Appendix C.3.2.

5.5 Evaluation of fDPO under Realistic RLHF Settings on Anthropic HH Datasets

We also conducted experiments on the Anthropic HH datasets, which consist of single-turn dialogues covering various topics such as academic questions or life guidance Bai et al. (2022). Here, we aimed to replicate a realistic RLHF setting where the number of high-quality responses created by humans is limited. Instead of generating all responses manually, SFT models are used to create response pairs, and human annotators only provide labels (chosen or rejected) to the pairs. This setup is cost-effective because generating high-quality responses manually is expensive, while annotating SFT-generated pairs is less so. This approach is consistent with the RLHF pipeline used in Ouyang et al. (2022); Pace et al. (2024); Yuan et al. (2024), which utilize unlabeled prompts effectively.

Specifically, we treated the original responses in the Anthropic HH datasets as high-quality responses, comprising 25% of the dataset. The remaining 75% of the responses were generated by the SFT model. These responses were then annotated as chosen or rejected by the gold RM.

The evaluation metrics used in this study included the gold RM score, as described in the previous sections, and an additional evaluation using GPT-4o to determine the win rate. The win rate indicates how often responses generated by the trained LM were preferred over those generated by the initial SFT model.

As shown in Table 1, based on three independent runs, fDPO outperformed the baseline in both evaluation metrics of the gold RM scores and GPT-4o win rates. The superior GPT-4o evaluation results suggest that fDPO is not merely optimizing for the reward model but is also learning to generate higher-quality responses from a human evaluation perspective. This demonstrates the effectiveness of our approach under realistic RLHF settings, providing a viable solution for scenarios where high-quality responses are limited.

6 Conclusions

This study explores how the quality of a preference dataset impacts LMs optimized using DPO, especially when compared with the RLHF method. We found that the quality of chosen responses significantly influences DPO performance. To address this, we proposed filtered DPO (fDPO), which uses a reward model to identify and discard lower-quality data, refining the DPO process. Our experiments demonstrated that fDPO improved DPO’s performance, effectively handling datasets with quality discrepancies. While the use of a reward model introduces additional computational costs and complexity, it allows for more effective leveraging of limited data. Overall, this highlights the practical value of fDPO’s approach, especially in scenarios where data quality is heterogeneous.

7 Limitations

The fDPO method shows promise, but it has some limitations. First, the method requires a reward model, which might be a drawback as it increases the complexity and computational time of the method. However, the availability of high-quality reward models provides an opportunity to leverage these high-end models within the DPO framework. Exploring the use of implicit rewards in DPO instead of an explicit reward model could also address some complications associated with training a separate reward model. Second, the algorithm is implemented in its simplest form, suggesting significant room for improvement and optimization. Third, our approach does not account for rejected responses, which could further enhance performance if considered. Finally, our experiments are limited to relatively small LLMs and comparisons with DPO. Future work should explore combining fDPO with other DPO-related extensions and conducting comparisons with other RLHF methods, especially with larger LLMs.

References

Amini et al. (2024) Afra Amini, Tim Vieira, and Ryan Cotterell. 2024. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571.
Azar et al. (2023) Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. 2023. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In ACM Conference on Fairness, Accountability, and Transparency, pages 610–623.
Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430.
Bommasani et al. (2022) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2022. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
Bradley and Terry (1952) Ralph A. Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrikayesyesys, 39:324–345.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems.
Casper et al. (2023) Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research.
Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
Coste et al. (2023) Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. 2023. Reward model ensembles help mitigate overoptimization. In International Conference on Learning Representations.
Dubois et al. (2023) Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. AlpacaFarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
Eisenstein et al. (2023) Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. 2023. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244.
Gao et al. (2023) Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866.
Gou and Nguyen (2024) Qi Gou and Cam-Tu Nguyen. 2024. Mixed preference optimization: Reinforcement learning with data selection and better reference model. arXiv preprint arXiv:2403.19443.
Guo et al. (2024) Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. 2024. Direct language model alignment from online AI feedback. arXiv preprint arXiv:2402.04792.
Kirk et al. (2024) Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2024. Understanding the effects of RLHF on LLM generalisation and diversity. In International Conference on Learning Representations.
Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. In Association for Computational Linguistics, page 3214–3252.
Liu et al. (2024) Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, and Xuanhui Wang. 2024. LiPO: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878.
Moerland et al. (2022) Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. 2022. Model-based reinforcement learning: A survey. arXiv preprint arXiv:2006.16712.
Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2021. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744.
Pace et al. (2024) Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2024. West-of-N: Synthetic preference generation for improved reward modeling. arXiv preprint arXiv:2401.12086.
Pal et al. (2024) Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. 2024. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228.
Pikus et al. (2023) Ben Pikus, Will LeVine, Tony Chen, and Sean Hendryx. 2023. A baseline analysis of reward models’ ability to accurately analyze foundation models under distribution shift. arXiv preprint arXiv:2311.14743.
Radford et al. (2019) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskeve. 2019. Language models are unsupervised multitask learners. In OpenAI blog 1.8.
Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems.
Rennie et al. (2017) S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. 2017. Self-critical sequence training for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1195.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Singh et al. (2024) Anikait Singh, Fahim Tajwar, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. 2024. Understanding preference fine-tuning for large language models. In International Conference on Machine Learning.
Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to summarize from human feedback. In Advances in Neural Information Processing Systems.
Sutton and Barto (2018) R. S. Sutton and A. G. Barto. 2018. Reinforcement Learning, 2nd edition. MIT Press.
Tang et al. (2024) Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, and Bilal Piot. 2024. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wang et al. (2024) Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. 2024. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. In International Conference on Learning Representations.
Williams (1992) R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
Xu et al. (2023) Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. 2023. Some things are more CRINGE than others: Iterative preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682.
Yuan et al. (2024) Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
Zhang et al. (2024) Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, and Yang Liu. 2024. Overcoming reward overoptimization via adversarial policy optimization with lightweight uncertainty estimation. arXiv preprint arXiv:2403.05171.
Zhao et al. (2023) Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. 2023. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.

Appendix A Ethical considerations

This study addresses the challenge of aligning large language models with human preferences. We used publicly available datasets (AlpacaFarm and Anthropic HH), ensuring data transparency and privacy. While this study did not specifically evaluate models for biases, we acknowledge the significance of these considerations and commit to addressing them in future work.

Appendix B Justification on filtering chosen responses

To understand the impact of the quality of chosen responses on the performance of the DPO algorithm, we presents a theoretical analysis focused on the differential sensitivity of the DPO algorithm to chosen ( $y_{c}$ ) and rejected ( $y_{r}$ ) responses. The analysis elucidates how the DPO update affects the probability of chosen responses relative to rejected ones, which is a key consideration in designing our proposed approach fDPO. This understanding is vital to enhance the efficiency of DPO, which fDPO achieves by selectively discarding low-quality $y_{c}$ samples during training. For simplicity in this analysis, we will occasionally omit the prompt $x$ , denoting $\pi_{\theta}(y\,|\,x)$ simply as $\pi_{\theta}(y)$ .

Proposition B.1.

Let the following assumptions hold:

•

the magnitudes of the gradients for $\log\pi_{\theta}(y_{c})$ and $\log\pi_{\theta}(y_{r})$ are similar, i.e.,

\|\nabla_{\theta}\log\pi_{\theta}(y_{c})\|\simeq\|\nabla_{\theta}\log\pi_{% \theta}(y_{r})\|,

•

the gradients for $\log\pi_{\theta}(y_{c})$ and $\log\pi_{\theta}(y_{r})$ are nearly orthogonal, i.e.,

\nabla_{\theta}\log\pi_{\theta}(y_{c})^{\!\top}\nabla_{\theta}\log\pi_{\theta}% (y_{r})\simeq 0,

•

the ratio of the probabilities is given by $\pi_{\theta}(y_{c})/\pi_{\theta}(y_{r})=\delta$ .

When the DPO algorithm updates the parameter $\theta$ with

\displaystyle\Delta\theta=\alpha\beta w(y_{c},y_{r})(\nabla_{\theta}\log\pi_{% \theta}(y_{c})-\nabla_{\theta}\log\pi_{\theta}(y_{r})),

where $\alpha$ is the learning rate and is sufficiently small, the sensitivity of $\pi_{\theta}(y_{c})$ , defined as the magnitude of change in probability, $\Delta\pi_{\theta}(y)$ , is approximately $\delta$ times higher than that of $\pi_{\theta}(y_{r})$ .

Proof:

Since $\alpha$ is sufficiently small, which implies that the higher-order terms can be ignored, the variation in probabilities can be approximated as

	$\displaystyle\Delta\pi_{\theta}(y)$	$\displaystyle=\Delta\theta^{\!\top}\nabla_{\theta}\pi_{\theta}(y)+\mathcal{O}(% \Delta\theta^{\!\top}\Delta\theta)$
		$\displaystyle\simeq\pi_{\theta}(y)\Delta\theta^{\!\top}\nabla_{\theta}\log\pi_% {\theta}(y).$

Given the assumptions, the magnitudes of the gradients for $\log\pi_{\theta}(y_{c})$ and these gradients are nearly orthogonal. Hence, the impact of $\Delta\theta$ on $\log\pi_{\theta}(y_{c})$ and $\log\pi_{\theta}(y_{r})$ would be similar in magnitude but differ in direction. However, due to the ratio $\pi_{\theta}(y_{c})/\pi_{\theta}(y_{r})=\delta$ , the rate of change in $\pi_{\theta}(y_{c})$ is amplified by a factor of $\delta$ compared to $\pi_{\theta}(y_{r})$ . Thus, under the DPO update, $\pi_{\theta}(y_{c})$ demonstrates a sensitivity that is approximately $\delta$ times higher than that of $\pi_{\theta}(y_{r})$ . ∎

As the training progresses in DPO, it is generally observed that the ratio $\delta=\pi_{\theta}(y_{c})/\pi_{\theta}(y_{r})$ , representing how much more likely $y_{c}$ is compared to $y_{r}$ , tends to exceed $1$ . This phenomenon indicates an increased sensitivity towards the chosen responses, emphasizing the criticality of their quality within the DPO framework. Consequently, the presence of low-quality chosen responses in the dataset can significantly impede the effectiveness of DPO. Our proposed fDPO addresses this issue by selectively discarding samples with low-quality chosen responses during training, thereby enhancing the overall performance and robustness of the model.

However, it is essential to acknowledge that the assumptions leading to these observations are strong and may not hold in some contexts and datasets. Therefore, further experimental work is necessary to validate these assumptions. Additionally, considering rejected responses in fDPO represents a separate but exciting area for future exploration, potentially offering new insights into data refinement approaches of preference-based model optimization.

Appendix C Details of experiments

C.1 Hyperparameters

We provide details of the hyperparameters used in our experiments. The hyperparameters were chosen to optimize the performance of DPO. Table 2 presents the training parameters for DPO and RLHF of the AlpacaFarm dataset and DPO of the Anthropic HH dataset. Table 3 illustrates the parameters for response generation.

Parameter	DPO 160M (AlpacaFarm)	DPO 1.4B (AlpacaFarm)	RLHF (AlpacaFarm)	DPO 2.8B (Anthropic HH)
epochs	8 (16 for fDPO)	4 (8 for fDPO)	3	4 (8 for fDPO)
batch size	64	64	32	64
learning_rate	$1\times 10^{-6}$	$2\times 10^{-7}$	$1\times 10^{-6}$	$5\times 10^{-6}$
$\beta$	0.1	0.1	0.1	0.1
PPO epochs	-	-	4	-
number of rollouts	-	-	256	-

Table 2: Experimental parameters of DPO and RLHF

Parameter	AlpacaFarm	Anthropic HH
Top p	0.9 (1.0 for fDPO and RLHF)	0.9 (1.0 for fDPO)
Top k	0	0
Temperature	1.0	1.0
Max response tokens	64	256

Table 3: Response generation parameters

C.2 Prompts for GPT-4o evaluation

We utilize the evaluation prompt for assessing helpfulness as described in Rafailov et al. (2023). The prompt is as follows:

For the following query to a chatbot, which response is more helpful?

Query: <the user query>

Response A:
<either the test method or baseline>

Response B:
<the other response>

FIRST provide a one-sentence comparison of the two responses and explain
which you feel is more helpful. SECOND, on a new line, state only "A" or
"B" to indicate which response is more helpful. Your response should use
the format:
Comparison: <one-sentence comparison and explanation>
More helpful: <"A" or "B">

Similarly, we adopt the evaluation prompt for assessing harmlessness. The prompt is as follows:

For the following query to a chatbot, which response is more harmless?

Query: <the user query>

Response A:
<either the test method or baseline>

Response B:
<the other response>

FIRST provide a one-sentence comparison of the two responses and explain
which you feel is more harmless. SECOND, on a new line, state only "A" or
"B" to indicate which response is more harmless.
Your response should use the format:
Comparison: <one-sentence comparison and explanation>
More harmless: <"A" or "B">

C.3 Additional results on AlpacaFarm dataset

C.3.1 Detailed evaluation

The evaluation scores of the gold reward model for the preference datasets (high-quality, low-quality, mix-quality) of the AlpacaFarm dataset are detailed in Table 4. The mix-quality datasets (Mix 1-5) each consist of 50% randomly sampled data from the high-quality dataset and the low-quality dataset, using random seeds 1-5, respectively.

Model Size	Dataset Quality	Chosen Mean	Rejected Mean	Overall Mean
160M	High	-0.950	-2.786	-1.868
	Low	-2.153	-3.180	-2.667
	Mix 1	-1.549	-2.978	-2.263
	Mix 2	-1.547	-2.984	-2.265
	Mix 3	-1.555	-2.984	-2.270
	Mix 4	-1.551	-2.983	-2.267
	Mix 5	-1.545	-2.983	-2.264
1.4B	High	1.220	-0.996	0.113
	Low	-0.240	-1.482	-0.860
	Mix 1	0.500	-1.233	-0.367
	Mix 2	0.487	-1.236	-0.375
	Mix 3	0.487	-1.247	-0.380
	Mix 4	0.496	-1.231	-0.367
	Mix 5	0.495	-1.234	-0.370

Table 4: The evaluation scores of gold reward for AlpacaFarm dataset

Figure 6 provides the learning curves for DPO and fDPO with the the 160M-sized LM, corresponding to the final performances depicted in Figure 1 (B) of the main text. The curves show that even though fDPO processes double the number of epochs compared to DPO, the total number of steps for fDPO is fewer than that for DPO. This reduction is due to the filtering process of fDPO, which decreased the data over epochs, resulting in fewer steps per epoch, as demonstrated in Figure 8. Additionally, when assessed using KL divergence, the performance of fDPO shows a trend towards converging with the DPO trained on the high-quality dataset, suggesting that fDPO can reduce the performance gap even when trained on mixed-quality data.

Figure 7 presents the learning curves for DPO and fDPO applied to the mix-quality dataset with the 1.4B LM and the low-quality dataset with the 160M LM. In both contexts, fDPO consistently improved the performance of DPO over steps, echoing the results observed in the mix-quality dataset scenario with the 160M LM.

C.3.2 Analysis of filtered samples of fDPO

We examined how data was selectively discarded throughout the learning process of fDPO with the mix-quality dataset. Figure 8 presents the unfiltered ratio, accuracy, precision, and recall at each epoch. The unfiltered ratio reflects the proportion of data that remains after filtering. Accuracy reflects the overall correctness of the filtering decisions, both for deletion and retention of samples, based on their gold reward quality. Precision measures how accurately the samples decided for deletion were actually of lower quality, while recall evaluates the success in identifying and discarding all samples that warranted removal. The result of the unfiltered ratio indicates an exponential decay in the number of samples used in each epoch. The consistency of accuracy and precision across epochs suggests that data was discarded with a constant efficiency. The lower precision compared to accuracy can be attributed to the relatively small number of samples that warranted removal. Conversely, recall decreases with progressing epochs. This decline can be tied to the static errors within the proxy RM, leading to consistently overestimated $y_{c}$ samples, thus increasing their relative proportion over time. The figure contrasts various margin settings with the no-margin condition ( $\epsilon=0$ ), revealing that larger margins lead to slower filtering speeds. Notably, as the margin increases, precision improves at the expense of recall. This trade-off indicates the importance of carefully tuning the margin parameter $\epsilon$ to balance filtering efficacy.

C.4 Additional results on Anthropic HH Datasets

C.4.1 Detailed evaluation

The evaluation scores of the gold reward model for our preference datasets (original, SFT-model-generated, mix-quality) of the Anthropic HH dataset are detailed in Table 5. The mix-quality datasets (Mix 1-3) each consist of 25% randomly sampled responses from the original Anthropic HH dataset and 75% from the responses generated by the SFT model, using random seeds 1-3, respectively.

Dataset	Type	Chosen Mean	Rejected Mean	Overall Mean
Helpful	Original	-0.294	-1.549	-0.922
	SFT Generated	-0.613	-1.931	-1.272
	Mix 1	-0.537	-1.836	-1.187
	Mix 2	-0.532	-1.839	-1.185
	Mix 3	-0.536	-1.833	-1.185
Harmless	Original	-3.142	-4.622	-3.882
	SFT Generated	-4.164	-5.455	-4.810
	Mix 1	-3.905	-5.245	-4.575
	Mix 2	-3.907	-5.250	-4.579
	Mix 3	-3.914	-5.250	-4.582

Table 5: The evaluation scores of gold reward for Anthropic HH datasets.

Figure 9 provides using the helpful dataset and the harmless dataset of the Anthropic HH datasets, respectively. The curves show that even though fDPO processes double the number of epochs compared to DPO, the total number of steps for fDPO is fewer due to the filtering process.

C.4.2 Future Evaluation

In addition to the aforementioned experiment, we conduct an experiment with 50% original and 50% SFT-generated responses. The detailed results of this experiment are provided in Table 6.

Dataset	Method	Gold RM Score (SFT= $0.0$ ) $\uparrow$	GPT-4o Evaluation (win rate vs. SFT) $\uparrow$
Helpful	DPO	1.72 $\pm$ 0.03	0.575 $\pm$ 0.007
Helpful	fDPO	1.85 $\pm$ 0.05	0.602 $\pm$ 0.010
Harmless	DPO	2.74 $\pm$ 0.07	0.856 $\pm$ 0.012
Harmless	fDPO	3.23 $\pm$ 0.01	0.955 $\pm$ 0.007

Table 6: Evaluation on the Anthropic HH datasets, where the responses consist of 50% original and 50% SFT-generated responses. The values represent the mean and standard error over 3 seeds.