Understanding Reference Policies in Direct Preference Optimization

Yixin Liu¹ Pengfei Liu² Arman Cohan^1,3
¹Yale University ²Shanghai Jiao Tong University ³Allen Institute for AI
[email protected], [email protected], [email protected]

Abstract

Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO – its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO’s effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL-divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of reference policies for instruction fine-tuning by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO’s superiority. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.

Yixin Liu¹ Pengfei Liu² Arman Cohan^1,3 ¹Yale University ²Shanghai Jiao Tong University ³Allen Institute for AI [email protected], [email protected], [email protected]

1 Introduction

Recently, alternative training algorithms have been proposed to address the limitations of the standard supervised fine-tuning (SFT) algorithm of large language models (LLMs). Specifically, unlike the SFT training objective which relies on a single gold reference, advanced training algorithms, such as Direct Preference Optimization (DPO) (Rafailov et al., 2023) and contrastive learning (Yuan et al., 2023; Zhao et al., 2023b), require the LLMs to learn to differentiate the quality of various candidate outputs and to generate outputs that are more likely to be preferred, i.e., preference learning Ouyang et al. (2022). Among such methods, DPO is especially intriguing because of its connection with reinforcement learning (RL). Specifically, Rafailov et al. (2023) show that an LLM trained under the KL-constrained RL setting implicitly learns a reward model which is used in the DPO training objective: $r_{\theta}(x,y)=\beta\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}$ , where $x$ and $y$ represent the input and output texts, $p_{\theta}$ and $p_{\mathrm{ref}}$ are the distributions parameterized by the LLM being fine-tuned and the reference LLM respectively, and $\beta$ controls the strength of the KL-divergence regularization applied from the reference LLM.

This reward function parameterization has interesting implications. Specifically, there can exist a discrepancy between the prediction of $r_{\theta}(x,y)$ , the optimization target, and $p_{\theta}(y|x)$ , the learned distribution parameterized by the fine-tuned LLM. Consequently, given an input $x$ and a pair of outputs $y_{w}$ and $y_{l}$ between which $y_{w}$ has better quality, even if the reward model correctly predicts that $r_{\theta}(x,y_{w})>r_{\theta}(x,y_{l})$ , it does not guarantee $p_{\theta}(y_{w}|x)>p_{\theta}(y_{l}|x)$ , that the fine-tuned model learns to assign a higher probability to the better output Chen et al. (2024). We verified this observation in a preliminary study, where we found that the learned reward model $r_{\theta}$ has an accuracy of 76.3% while the learned distribution $p_{\theta}$ only achieves an accuracy of 46.2% in comparing human-annotated output pairs, using DPO-tuned and reference LLMs in Tulu 2 Ivison et al. (2023).¹¹1Complete results are in §4.2.

The discrepancy noted above highlights an under-explored aspect of DPO – its dependence on the reference model/policy $p_{\mathrm{ref}}$ .²²2We will interchangeably use the terms “model” and “policy” to reflect the contexts of both LLM fine-tuning and RL. The reference model in practice is usually instantiated by the SFT model, which is also the starting point for DPO fine-tuning (Rafailov et al., 2023). As a result, the fine-tuned model cannot arbitrarily deviate from, or thereby improve upon, the reference (SFT) model because the KL-divergence constraint penalizes such deviations, leading to the discrepancy in the ranking performance between $r_{\theta}$ and $p_{\theta}$ . It thus seems that removing or reducing such constraints might be beneficial. Recent work Gorbatovski et al. (2024) does find that updating the reference policy helps improve DPO’s performance. However, eliminating this dependence can actually lead to performance degradation because of model degeneration (Rafailov et al., 2023). These findings suggest a complicated relationship between the reference policy and the DPO performance.

Therefore, we aim to further understand the effects and the role of the reference policy in DPO. To this end, we choose the task of instruction fine-tuning of LLMs Ouyang et al. (2022) as the test bed and two series of open-source pre-trained LLMs, Tulu 2 and Mistral Jiang et al. (2023), to analyze the reference policy in DPO on the AlpacaEval benchmark Li et al. (2023). Specifically, we investigate three main research questions (RQs).

First, assuming that reference policies are necessary for instruction fine-tuning, we explore RQ1: what the optimal strength of the regularization from the reference policy should be (§4). We find that a smaller constraint typically leads to better performance, until the constraint becomes too small, at which point performance begins to degrade. We then conduct an in-depth analysis of this phenomenon, examining it from both a sequence-level perspective, focusing on the ranking accuracy of learned models, and a token-level perspective, analyzing the fine-grained differences between the fine-tuned and reference models. This analysis reveals interesting findings. For example, after DPO fine-tuning, the predicted probability of the end-of-sequence token decreases by more than 10,000 times on average, with smaller constraints leading to greater differences in probability assignments.

Taking a step back, we also investigate RQ2: whether the reference policy is necessary for preference learning (§5). To this end, we examine two related reward parameterizations for instruction fine-tuning, specifically using either the posterior probability $p_{\theta}(y|x)$ or the likelihood function $p_{\theta}(x|y)$ as the reward function, serving as an alternative to the DPO parameterization $\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}$ . We discuss their connections with RL under the analytical framework proposed in DPO, as well as provide empirical comparisons. We find that while these alternative approaches can also improve the original SFT model performance, they cannot outperform DPO. The experimental results also reveal that the reference policy helps stabilize model behavior, as without it, the model can learn to generate outputs twice as long, potentially overfitting the human annotators’ preference for more verbose outputs.

Given the importance of the reference policy in the DPO objective, the third question (RQ3) we study is: whether DPO benefits from a stronger reference policy (§6). In detail, rather than the standard practice of using the SFT model to be fine-tuned as the reference model, we employ two more capable LLMs as the reference model. Our experiments reveal an interesting finding: stronger reference models can offer more benefits than the SFT model, but they only improve performance when compatible with fine-tuned model. This finding raises questions about the compatibility of the reference model with the model to be fine-tuned in DPO, suggesting a need for future work to analyze and measure their intrinsic similarity.

The experiments and analyses in this work emphasize the confounding effect of the reference policy in DPO. Our findings shed light on the best practices for DPO: (1) when using the SFT model as the reference policy, a smaller yet sufficiently strong constraint typically improves performance; (2) a stronger reference policy can improve performance but must be compatible with the model to be fine-tuned. Meanwhile, it also highlights open research questions regarding the necessity of the reference policy and calls for more theoretical analysis of the relationship between the reference policy and the DPO training performance and dynamics.³³3The training scripts and fine-tuned checkpoints are released in https://github.com/yale-nlp/refdpo.

2 Preliminaries

2.1 Instruction Fine-tuning

Instruction fine-tuning aims to align the pre-trained language models with the users’ intentions Ouyang et al. (2022). The frequently used training methods of instruction fine-tuning include standard supervised fine-tuning (SFT), reinforcement learning (RL) methods Stiennon et al. (2020); Ouyang et al. (2022) such as Proximal Policy Optimization Schulman et al. (2017), and reward-based training methods Rafailov et al. (2023); Yuan et al. (2023); Zhao et al. (2023a) such as Direct Preference Optimization Rafailov et al. (2023). Both RL and reward-based training methods offer advantages over standard SFT by leveraging human feedback, typically in the form of human-annotated rewards assigned to various candidates, enabling more effective alignment of LLMs with human preferences. Therefore, they are widely adopted in the instruction fine-tuning of large language models (LLMs) such as GPT-4 Achiam et al. (2023), Gemini Team et al. (2023), and Llama-3.⁴⁴4Llama-3’s technical report is at https://ai.meta.com/blog/meta-llama-3/.

2.2 Direct Preference Optimization

In Rafailov et al. (2023), DPO is proposed as an alternative to RL for training LLMs with human feedback. The DPO objective is derived from the KL-constrained reinforcement learning setting:

$\max_{p_{\theta}}\mathbb{E}_{x\sim D,y\sim p_{\theta}(y|x)}[r(x,y)]-\beta% \mathbb{D}_{\mathrm{KL}}[p_{\theta}(y|x)||p_{\mathrm{ref}}(y|x)],$

(1)

where $x$ is the input text, $y$ is the output text, $p_{\theta}$ is the probability distribution parameterized by the LLM under fine-tuning with trainable parameters $\theta$ , $p_{\mathrm{ref}}$ is a reference distribution which is often instantiated by another LLM, $\beta$ is a parameter controlling the strength of the KL-divergence penalty:

$\mathbb{D}_{\mathrm{KL}}[p_{\theta}||p_{\mathrm{ref}}]$ = $\mathbb{E}_{y\sim p_{\theta}(y|x)}[\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}% (y|x)}]$ .

(2)

$r(x,y)$ is the reward of $y$ given $x$ under the Bradley-Terry (BT) model Bradley and Terry (1952):

$p(y_{1}\succ y_{2}|x)=\frac{\exp(r(x,y_{1}))}{\exp(r(x,y_{1}))+\exp(r(x,y_{2}))}$ ,

(3)

where $p(y_{1}\succ y_{2}|x)$ is the probability of output $y_{1}$ being better than output $y_{2}$ . This reward model can be optimized by maximum likelihood estimation:

$\begin{aligned} \mathcal{L}(r)&=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log p(y_{% w}\succ y_{l}|x)]\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(r(x,y_{w})-r(x,y_{l}))].\\ \end{aligned}$

(4)

Here $x$ is an input in the dataset $D$ , $y_{l}$ and $y_{w}$ is a pair of outputs between which $y_{w}$ has better quality, and $\sigma(\cdot)$ is the sigmoid function.

DPO shows that for the aforementioned RL objective (Eq. 1), it is equivalent to optimizing the following objective: learning a reward model parameterized by the model’s policy $p_{\theta}$ (and the reference policy $p_{\mathrm{ref}}$ ) under the BT model:

$\begin{aligned} &\mathcal{L}_{\mathrm{DPO}}(p_{\theta};p_{\mathrm{ref}})=\\ &-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(\beta\log\frac{p_{\theta}(y_{w% }|x)}{p_{\mathrm{ref}}(y_{w}|x)}-\beta\log\frac{p_{\theta}(y_{l}|x)}{p_{% \mathrm{ref}}(y_{l}|x)})],\end{aligned}$

(5)

where a reward model is implicitly learned:

$r_{\theta}(x,y)=\beta\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}$ .

(6)

We note that the reference policy $p_{\mathrm{ref}}$ used in DPO is typically the LLM after supervised fine-tuning, which is the same LLM that will be fine-tuned using DPO. For simplicity, we refer to this model as the SFT model.

2.3 Related Work on Preference Learning

Apart from DPO, various training methods of preference learning have been proposed recently. For example, IPO (Azar et al., 2024) has been proposed to address the potential limitation of DPO, which tends to overlook the KL-divergence regularization, making it easy to overfit. On the other hand, many preference learning algorithms are reference-policy-free. Among these, contrastive learning is a major category that interprets the model-predicted probability of an output as a quality score Liu et al. (2022); Zhao et al. (2023b); Yuan et al. (2023); Xu et al. (2024). Other related methods also share similar training objectives that are based on the model-predicted probability Xu et al. (2023); Hong et al. (2024); Meng et al. (2024). These methods have been shown to achieve better or comparable performance to DPO in various settings, without the need for a reference model. However, we note that additional normalization or regularization is still required for these methods. For example, an MLE objective on the positive examples is used by SLiC-HF Zhao et al. (2023a), CPO Xu et al. (2024), ORPO Hong et al. (2024), among others. Meanwhile, SimPO Meng et al. (2024) highlights the importance of length normalization of the sequence log-likelihood in improving method effectiveness. In this work, we focus on analyzing DPO since it has a more well-established theoretical background while presenting a comparison with reference-free training objectives in §5.

3 DPO for Instruction Fine-tuning

In this work, we center our analysis in the context of DPO’s applications in instruction fine-tuning. Therefore, here we outline the experimental settings for our analysis in the following sections.

Datasets

Following the setup of Zephyr (Tunstall et al., 2023), we use the UltraFeedback Cui et al. (2023) dataset, specifically its binarized version,⁵⁵5The dataset is available at https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. for DPO fine-tuning. It contains 64K data examples consisting of one user instruction and a positive-negative output pair scored by GPT-4.⁶⁶6Further details of datasets are in Appendix A.1.

Models

We mainly use two SFT LLMs as the starting point for DPO fine-tuning from two recent works on instruction fine-tuning. The first LLM, mistral-7b-sft-beta⁷⁷7https://huggingface.co/HuggingFaceH4/mistral-7b-sft-beta from Zephyr (Tunstall et al., 2023), is fine-tuned from the Mistral 7B base model (Jiang et al., 2023) on the UltraChat Ding et al. (2023) dataset.⁸⁸8https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k We will refer to this model as mistral-7b for simplicity. The second LLM, tulu-2-7b⁹⁹9https://huggingface.co/allenai/tulu-2-7b from Tulu 2 (Ivison et al., 2023), is fine-tuned from the Llama-2-7B base model Touvron et al. (2023) on the TULU v2 SFT mixture dataset.¹⁰¹⁰10https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture We choose these two models for better transparency and reproducibility, since their experimental artifacts are publicly accessible.

Training Settings

We follow the training recipe proposed in Tunstall et al. (2023) for the experiments, which is also adopted by Ivison et al. (2023). Specifically, the models are fine-tuned with DPO for 3 epochs on the UltraFeedback dataset, with the batch size setting to 32 and linear learning rate scheduling with warmup.¹¹¹¹11More details are in Appendix A.2. Tunstall et al. (2023) notes that while the models fine-tuned using DPO tend to overfit, this overfitting does not hurt the models’ performance, thus they picked the final checkpoint for evaluation. However, to better understand the effects of the DPO training objective on the final model performance, we instead select the checkpoints based on their loss on the validation set of the UltraFeedback dataset.

Evaluations

We mainly focus on the evaluations of the LLMs’ instruction-following capabilities. To this end, we use the AlpacaEval2 Li et al. (2023) benchmark for the main evaluation, which consists of 804 test examples for evaluating instruction-following and uses GPT-4 to evaluate the system outputs by comparing them with the outputs generated by GPT-4.¹²¹²12gpt-4-1106-preview is used to generate the outputs and to perform the pairwise output comparison. Notably, its length-controlled version Dubois et al. (2024a), which mitigates the length bias of GPT-4’s evaluations by predicting and canceling it, achieves a 0.98 Spearman correlation with LMSYS’ Chatbot Arena Chiang et al. (2024) evaluation results collected from crowd-sourced human participants. Therefore, we mainly report the length-controlled AlpacaEval 2 scores in the following sections.

4 RQ1: What Is the Optimal KL Constraint Strength for DPO?

$\beta$	mistral-7b	tulu-2-7b
$\infty$ (original)	7.57	8.50
0.1 (previous work*)	13.42	9.20
0.1	14.03	9.38
0.05	13.29	9.96
0.02	16.06	10.46
0.01	16.25	7.86
0.005	12.36	[degenerate]

Table 1: Model performance on AlpacaEval2 with varying values of

\beta

(regularization strength with respect to the reference model). The DPO fine-tuning is started from mistral-7b and tulu-2-7b respectively. *: the DPO fine-tuned models in previous work, Zephyr Tunstall et al. (2023) and Tulu-2 Ivison et al. (2023).

The regularization from the reference policy in DPO is introduced by the KL-divergence constraint (Eq. 2). Therefore, we start our analysis of reference policies by investigating the impact of this KL constraint. Specifically, in KL-constrained RL and DPO, the weighting coefficient $\beta$ controls the trade-off between maximizing the reward $r$ and minimizing the deviation from the reference policy $p_{\mathrm{ref}}$ (Eq. 1). Therefore, we vary the value of $\beta$ to understand the impact of this KL constraint.

4.1 Main Results

In Table 1, we show the models’ performance on AlpacaEval2 that are fine-tuned using DPO from mistral-7b and tulu-2-7b. Following the original setting of DPO, we also used these two SFT models as the reference models. We found a similar trend with the two series of fine-tuned models – a smaller KL constraint generally improves performance, until the constraint becomes too small and leads to performance degradation.

The above observation can be intuitively explained by the need for the fine-tuned model to deviate from the reference model, which is the SFT model targeted for improvement, while maintaining necessary regularization. However, we note the algorithm’s sensitivity to the strength of the KL constraint: while the studies by Zephyr Tunstall et al. (2023) and Tulu 2 Ivison et al. (2023) both set the value of $\beta$ to be $0.1$ , the results in Table 1 suggest that a smaller optimal value of $\beta$ may be more effective, yet with an increasing risk of performance degradation. Next, we present further analysis of this KL-constraint strength sensitivity.

4.2 Ranking Accuracy

Recent work Chen et al. (2024) noted a discrepancy between the learned (implicit) reward model and the learned policy model in DPO in terms of their capabilities of correctly ranking candidate outputs according to their quality. Particularly, the ranking accuracy of the learned reward model (Eq. 6) is usually high, indicating that the better output $y_{w}$ is assigned a higher reward. i.e., $r_{\theta}(x,y_{w})>r_{\theta}(x,y_{l})$ . However, Chen et al. (2024) found that the ranking accuracy of the learned policy $p_{\theta}$ remains low even after DPO fine-tuning. They partly attributed this to the generally low ranking accuracy of the reference policy $p_{\mathrm{ref}}$ , noting that DPO fine-tuning rarely reverses the preference (i.e., $p(y_{w}|x)$ vs. $p(y_{l}|x)$ ) due to the KL constraint.

We noticed a similar trend in our experiments. Specifically, Table 2 shows the ranking accuracy of the learned reward model $r_{\theta}$ and the policy model $p_{\theta}$ on the validation set of UltraFeedback. We observe a positive correlation between the strength of the KL constraint ( $\beta$ ) and the ranking accuracy of $p_{\theta}$ - a sufficiently small $\beta$ is necessary to exceed the $50\%$ ranking accuracy of the random oracle baseline. Furthermore, a smaller $\beta$ can lead to a higher policy ranking accuracy, however, as noted in §4.1, it can cause model degradation.¹³¹³13Chen et al. (2024) has also noted that the ranking accuracy and the generation performance of the learned policy are not always positively correlated.

$\beta$	mistral-7b		tulu-2-7b
$\beta$	$r_{\theta}$	$p_{\theta}$	$r_{\theta}$	$p_{\theta}$
$\infty$ (original)	0.500	0.435	0.500	0.439
0.1	0.718	0.495	0.773	0.462
0.05	0.744	0.513	0.778	0.483
0.02	0.746	0.544	0.766	0.536
0.01	0.751	0.605	0.758	0.605
0.005	0.766	0.704	0.750	0.680

Table 2: Ranking accuracy of the learned reward model

r_{\theta}(x,y)

and the policy model

p_{\theta}(y|x)

on the Ultrafeedback validation set.

4.3 Token-Level Difference

Refer to caption — Figure 1: The average token log-probability difference distribution between the DPO-fine-tuned and reference models (Eq. 8). The models are fine-tuned from mistral-7b with different values of $\beta$ .

Rafailov et al. (2024) have recently shown DPO implicitly parameterizes a dense reward function:

$r_{\theta}(y^{(i)},s^{(i)})=\beta\log\frac{p_{\theta}(y^{(i)}|x,y^{(<i)})}{p_{% \mathrm{ref}}(y^{(i)}|x,y^{(<i)})}$ ,

(7)

where $y^{(i)}$ is the $i$ -th token (action step) in the output $y$ , $y^{(<i)}$ is the prefix of $y$ before $y^{(i)}$ . $s^{(i)}$ is the current “state”, which is determined by the input $x$ and the prefix $y^{(<i)}$ . Intuitively, as noted by Rafailov et al. (2024), this formulation implies that DPO can learn token-level credit assignment, which is proportional to the difference between the log probabilities assigned by the trained model and the reference model. Therefore, we use the statistic, token log-probability difference, to study the token-level difference between the trained and reference models:

$\hat{r}_{\theta}(y^{(i)},s^{(i)})=\log\frac{p_{\theta}(y^{(i)}|x,y^{(<i)})}{p_% {\mathrm{ref}}(y^{(i)}|x,y^{(<i)})}$ .

(8)

Figure 1 shows the distribution of the average log-probability difference of each token in the model’s vocabulary on the UltraFeedback validation set, with models fine-tuned from mistral-7b. We note the following:

(1) Similar to the findings in Rafailov et al. (2024), we found the DPO fine-tuning leads to a decrease in token probabilities in general.

(2) The distribution of token-level log-probability differences is imbalanced, with a small portion of tokens receiving very different probabilities.

(3) A smaller KL constraint leads to a larger portion of tokens with large probability differences.

Figure 2 displays the top 10 tokens with the largest probability differences that appear at least 100 times. Notably, the end-of-sequence token (EOS) is the most downweighted among them. Additionally, a smaller KL constraint results in a larger difference for the EOS token. Consequently, we observe that a smaller KL constraint generally leads to longer outputs, suggesting that the models have learned a preference for output length.

Discussion

Our token-level analysis reveals that as the strength of the KL constraint decreases, the DPO-fine-tuned model begins to assign significantly different probabilities to a small subset of specific tokens compared to the reference model. The extremeness of these log-probability differences is notable: a difference of $-10$ indicates that the token is downweighted by $e^{10}\approx 22000$ times on average. This partially explains the sensitivity to $\beta$ , as the increasing extremity of log-probability differences can destabilize the model. Moreover, it highlights a limitation of the sequence-level KL constraint: it does not effectively prevent extreme token-level differences.

5 RQ2: Is a Reference Policy Necessary for Effective Preference Learning?

We now take a step back to examine the necessity of using the SFT model as the reference policy for preference learning algorithms. As discussed in §2.3, there are already recent studies that proposed reference-free algorithms. We are, however, more interested in a closer comparison with DPO in terms of the training objective similarity and the connection with the RL objective, to better understand the role of the reference policies in DPO.

5.1 Training Objectives

To this end, we investigate two related training objectives and provide theoretical analyses in the framework established by Rafailov et al. (2023): optimizing the model using (1) the posterior probability ( $p_{\theta}(y|x)$ ) and (2) the likelihood function ( $p_{\theta}(x|y)$ ) as the reward function in the BT model.

Posterior probability as the reward

Recall that Rafailov et al. (2023) have shown that the DPO training objective learns a reward model: $r_{\theta}(x,y)=\beta\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}$ . An alternative option is to use the posterior probability with a scaling hyperparameter $\beta$ as the reward function, which does not require a reference model:

\tilde{r}_{\theta}(x,y)=\beta\log p_{\theta}(y|x).

(9)

The training objective under the BT model (Eq. 4) then becomes

$\begin{aligned} &\mathcal{L}_{\mathrm{Prob}}(p_{\theta})=\\ &-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(\beta\log p_{\theta}(y_{w}|x)-% \beta\log p_{\theta}(y_{l}|x))].\end{aligned}$

(10)

We note the connection between Eq. 10 and Expected Risk Minimization Edunov et al. (2018). Specifically, $\mathcal{L}_{\mathrm{Prob}}(p_{\theta})$ can be rewritten as

$\mathcal{L}_{\mathrm{Prob}}(p_{\theta})=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[% \log\frac{p_{\theta}(y_{w}|x)^{\beta}}{p_{\theta}(y_{w}|x)^{\beta}+p_{\theta}(% y_{l}|x)^{\beta}}],$

(11)

while the Expected Risk Minimization objective in a similar setting is

$\begin{aligned} &\mathcal{L}_{\mathrm{Risk}}(p_{\theta})=\\ &\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\sum_{y\in\{y_{w},y_{l}\}}\mathrm{cost}(y,% x)\frac{p_{\theta}(y_{w}|x)}{p_{\theta}(y_{w}|x)+p_{\theta}(y_{l}|x)}].\end{aligned}$

(12)

Thus, when the cost function is binary, i.e., $\mathrm{cost}(y_{w},x)=0$ and $\mathrm{cost}(y_{l},x)=1$ , and the scaling factor $\beta$ is set to 1, the only difference between Eq. 11 and Eq. 12 is the logarithm function in Eq. 11. This objective has also been investigated in recent work Xu et al. (2024) for fine-tuning LLMs for machine translation.

Furthermore, following the analytical framework in Rafailov et al. (2023), we can observe that optimizing $L_{\mathrm{Prob}}(p_{\theta})$ is equivalent to optimizing the policy model under the following RL objective:

$\max_{p_{\theta}}\mathbb{E}_{x\sim D,y\sim p_{\theta}(y|x)}[r(x,y)]+\beta% \mathrm{H}[p_{\theta}(y|x)]$ ,

(13)

where $\mathrm{H}$ is the entropy function:

\mathrm{H}[p_{\theta}(y|x)]=\mathbb{E}_{y\sim p_{\theta}(y|x)}[-\log p_{\theta% }(y|x)].

(14)

The proof for the equivalence of Eq. 10 and Eq. 13 is in Appendix B, with further analyses showing that Eq. 10 shares similar properties as DPO.

We note that Eq. 13 resembles the Maximum Entropy (MaxEnt) RL setting Ziebart et al. (2008); Eysenbach and Levine (2021) – apart from optimizing for a higher reward, this objective discourages the model from becoming overly deterministic.

Likelihood function as the reward

Apart from the posterior probability, another candidate reward function is the likelihood function $p(x|y)$ . Specifically, by Bayes’ theorem, we have

p(y|x)\propto p(x|y)p(y).

(15)

That is, the posterior distribution $p(y|x)$ that the model aims to learn can be factorized into the likelihood function $p(x|y)$ and the prior distribution $p(y)$ . The prior distribution $p(y)$ is the learning objective in the language model pre-training. Meanwhile, under the task of instruction-following, $p(x|y)$ depicts how likely the instruction $x$ is related to the model response $y$ .

With the above factorization, we derive the learning objective with $p(x|y)$ as the reward function. Specifically, using Bayes’ theorem again, we have

$p(x|y)=\frac{p(y|x)p(x)}{p(y)}.$

(16)

To model $p(x|y)$ , we use the LLM under fine-tuning to parameterize $p(y|x)$ , i.e., $p_{\theta}(y|x)$ , and use a fixed LLM to parameterize $p(y)$ , which we label $p_{\mathrm{LM}}(y)$ .¹⁴¹⁴14We use the SFT model in this work. We will show later that $p(x)$ does not need an explicit parameterization. We have

$p_{\theta}(x|y)=\frac{p_{\theta}(y|x)p(x)}{p_{\mathrm{LM}}(y)}.$

(17)

Using $p_{\theta}(x|y)$ as the reward function with a scaling factor $\beta$ , the objective under the BT model (Eq. 4) becomes

$\begin{aligned} &\mathcal{L}_{\mathrm{Likelihood}}(p_{\theta})\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(\beta\log p_{\theta}(x|y_{w})% -\beta\log p_{\theta}(x|y_{l}))]\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\frac{p_{\theta}(x|y_{w})^{\beta}}{p% _{\theta}(x|y_{w})^{\beta}+p_{\theta}(x|y_{l})^{\beta}}]\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\frac{(\frac{p_{\theta}(y_{w}|x)}{p_% {\mathrm{LM}}(y_{w})})^{\beta}}{(\frac{p_{\theta}(y_{w}|x)}{p_{\mathrm{LM}}(y_% {w})})^{\beta}+(\frac{p_{\theta}(y_{l}|x)}{p_{\mathrm{LM}}(y_{l})})^{\beta}}]% \end{aligned}$

(18)

Eq. 18 is equivalent to using $\frac{p_{\theta}(y|x)}{p_{\mathrm{LM}}(y)}$ as the reward function, which has an intuitive interpretation – the conditional-dependent quality score of $y$ is its conditional probability given $x$ normalized by its unconditional probability.

Furthermore, under the analytical framework of DPO, we note that Eq. 18 is equivalent to the KL-constrained RL setting (Eq. 1) with the prior probability $p_{\mathrm{LM}}(y)$ as the reference policy. That is,

$\max_{p_{\theta}}\mathbb{E}_{x\sim D,y\sim p_{\theta}(y|x)}[r(x,y)]-\beta% \mathbb{D}_{\mathrm{KL}}[p_{\theta}(y|x)||p_{\mathrm{LM}}(y)].$

(19)

5.2 Result Analysis

Method	Score	Accuracy	Length	$\boldsymbol{\beta}$
original (SFT)	7.57	0.435	205.0	-
DPO (Eq. 5)	16.25	0.605	359.4	0.01
probability (Eq. 10)	12.84	0.697	744.5	100.0
likelihood (Eq. 18)	13.63	0.602	389.3	0.01

Table 3: Performance comparison of different training objectives for fine-tuning mistral-7b. The best performance with the optimal

\beta

is reported. The AlpacaEval2 score, the ranking accuracy of the model’s probability, and the output length are reported.

We use mistral-7b as the base model for fine-tuning to compare DPO with the other two reward parameterizations discussed above. Table 3 summarizes the performance of different algorithms with the optimal value of $\beta$ . We found that

(1) DPO outperforms other fine-tuning methods, and using the unconditional probability as the reference policy (Eq. 18) outperforms the reference-free training objective (Eq. 10).

(2) All fine-tuning methods improve the performance of the original SFT model;

(3) Using posterior probability as the reward function (Eq. 10) achieves the highest ranking accuracy based on the predicted probability and results in significantly longer outputs. This objective also results in a large optimal value of $\beta$ , likely because increasing the entropy helps prevent degeneration.

Case Study

To further understand the difference in these training objectives, we present a case study with the EOS token. Specifically, as discussed in §4.3, the model trained using DPO learns to assign significantly different probabilities to the EOS token compared with the SFT model. We found a similar pattern with the other two training algorithms compared in this section. Therefore, here we provide a further investigation.

Specifically, we calculate two statistics: (1) the average log-probability assigned to the EOS token; (2) the average difference in the log-probability assigned to the EOS token between positive ( $y_{w}$ ) and negative ( $y_{l}$ ) examples, illustrating how the model behaves differently in these two scenarios with the EOS token. Figure 3 presents these two statistics of models fine-tuned with different reward function parameterizations (averaged over different values of $\beta$ ). We note the difference between DPO and the fine-tuning method that uses posterior probability as the reward function: the former achieves a larger distinction between positive and negative examples in terms of the probability assignment to the EOS token, while the latter shows a smaller distinction, despite generally assigning more extreme probabilities to the EOS token, which leads to significantly longer outputs. This suggests that the reference policy in preference learning algorithms helps to stabilize the model behavior. We present a qualitative case study of the outputs generated with different training objectives in Appendix C.

6 RQ3: Does DPO Benefit from Stronger Reference Policies?

In DPO the reference policy is instantiated using the SFT model (Rafailov et al., 2023) to be further fine-tuned. In the previous sections, we have demonstrated the importance of the reference policies in DPO. Now we investigate the next question: does DPO benefit from stronger reference policies?

6.1 Experimental Settings

We choose two LLMs as the stronger reference policies in DPO for the fine-tuning of mistral-7b and tulu-2-7b: (1) mistral-7b-instruct-v0.2 Jiang et al. (2023),¹⁵¹⁵15https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 (2) llama-3-70b-instruct.¹⁶¹⁶16https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct We chose these two models as they have shown better instruction-following capabilities than the two base models we used in DPO fine-tuning, achieving length-controlled AlpacaEval2 scores of 20.55 and 34.40 respectively. We refer to them as mistral-v0.2 and llama3 for simplicity.

6.2 Result Analysis

Table 4 shows the performance of models after DPO fine-tuning with various values of $\beta$ with different reference models. For easier comparison, we also show the model performance with optimal values of $\beta$ using different reference models in Table 5. We note the following:

(1) A stronger reference policy in DPO fine-tuning can improve DPO’s effectiveness. Specifically, the model fine-tuned from mistral-7b performs best with mistral-v0.2 as the reference, and the model fine-tuned from tulu-2-7b performs best with llama3 as the refererence.

(2) However, a stronger reference policy in DPO fine-tuning does not always lead to better performance. Notably, DPO fine-tuning of tulu-2-7b with mistral-v0.2 as the reference policy did not improve tulu-2-7b’s original performance. We provide further discussion about this in §6.3.

(3) The optimal KL constraint strength is larger with a stronger, suitable reference policy. In particular, as shown in Table 4, the optimal value of $\beta$ for mistral-7b and tulu-2-7b is 1.0 with their respective suitable reference policies. Conversely, when the SFT model is used as the reference policy, the optimal $\beta$ is 0.01 or 0.02, as shown in Table 1.

$\beta$	mistral-7b		tulu-2-7b
$\beta$	mistral-v0.2	llama3	mistral-v0.2	llama3
10.0	18.74	13.29	7.61	9.79
1.00	20.25	9.59	7.85	11.17
0.10	19.58	10.99	[degenerate]	10.31
0.01	17.18	15.37	[degenerate]	9.16
0.005	15.34	11.70	[degenerate]	3.29

Table 4: Model performance on AlpacaEval2 with varying values of

\beta

fine-tuned from mistral-7b and tulu-2-7b. Two reference models, mistral-v0.2 and llama3 are used.

	original	self	mistral-v0.2	llama3
mistral-7b	7.57	16.25	20.25	15.37
tulu-2-7b	8.50	10.46	7.85	11.17

Table 5: Model performance on AlpacaEval2. The best performance across different values of

\beta

with different reference models is compared, including the SFT model itself (self), mistral-v0.2, and llama3.

6.3 Discussion

Our investigation reveals while a stronger reference model can be helpful in DPO, it is not always the case. One possible explanation is that the reference model should be similar enough to the model under training to be compatible. This could explain why mistral-7b performs better with mistral-v0.2 as a reference, while tulu-2-7b performs better with llama3: the first pair are fine-tuned from the same base model. Meanwhile, tulu-2-7b is fine-tuned from llama-2-7b, which shares a similar model architecture with llama3 and likely uses similar pre-training corpora as well.

Regarding model similarity and the choice of the reference policy, we acknowledge the need for analytical tools and call for further research in this area. The KL divergence is a potential method, but estimating it is challenging due to the large sampling space, leading to high variance.

We note an interesting aspect in the fine-tuning of tulu-2-7b using llama3 as the reference policy. Specifically, these models use different tokenizers, meaning that they do not share the same token-level action space. The fact that this combination yields further improvement suggests that action spaces do not always need to be aligned in DPO, despite related work proving that DPO can implicitly parameterize a dense reward function, which requires a consistent action space Rafailov et al. (2024).

7 Conclusion

In this work, we studied the effects of reference policies in DPO, which can stabilize the training while introducing a potential performance upperbound. We found that DPO’s performance is sensitive to the strength of constraints from the reference policy and demonstrated the benefits of using reference policies in preference learning by comparing DPO to related training objectives. We also showed that a stronger reference policy can enhance DPO performance, but only with a compatible reference policy. Our findings highlight the confounding role of reference policies in DPO, providing empirical insights and emphasizing the need for more in-depth theoretical analysis. We call for future work to provide more theoretical and empirical guidelines for choosing the constraint strength from the reference policy, as well as understanding the similarity and compatibility between the model to be trained and the reference model.

Limitations

Evaluations

We mainly use the AlpacaEval2 benchmark to evaluate the instruction-following capabilities of LLMs in this work, on which the LLM-based evaluation achieves a high correlation with human evaluations Dubois et al. (2024a). The test examples on this benchmark cover a wide range of instruction types Dubois et al. (2024b), which enhances the generalizability of the evaluation results. However, we acknowledge that the LLM-based automatic evaluation has limitations, such as low self-consistency rates in their predictions and a preference for the LLMs’ own outputs Liu et al. (2023); Wang et al. (2023); Panickssery et al. (2024). Therefore, while we did not perform human evaluations because the large number of analyses we conducted makes it cost-ineffective, we acknowledge its importance in terms of achieving more faithful evaluation results.

Model Scale

The LLMs we studied in this work primarily consist of around 7 billion parameters. Due to computing resource constraints, we did not investigate LLMs with more parameters. Therefore, whether our observations are applicable on a larger scale remains to be investigated.

Acknowledgements

We thank Hamish Ivison for helpful suggestions regarding the training details of DPO. We are grateful for the TPU compute support provided by the Google TRC program.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
Azar et al. (2024) Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. 2024. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR.
Bradley and Terry (1952) Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
Chen et al. (2024) Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, and Kyunghyun Cho. 2024. Preference learning algorithms do not learn preference rankings. arXiv preprint arXiv:2405.19534.
Chiang et al. (2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.
Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, Singapore. Association for Computational Linguistics.
Dubois et al. (2024a) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024a. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
Dubois et al. (2024b) Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. 2024b. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36.
Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 355–364, New Orleans, Louisiana. Association for Computational Linguistics.
Eysenbach and Levine (2021) Benjamin Eysenbach and Sergey Levine. 2021. Maximum entropy RL (provably) solves some robust RL problems. In International Conference on Learning Representations.
Gorbatovski et al. (2024) Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. 2024. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
Hong et al. (2024) Jiwoo Hong, Noah Lee, and James Thorne. 2024. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
Ivison et al. (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023. Camels in a changing climate: Enhancing lm adaptation with tulu 2. Preprint, arXiv:2311.10702.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Li et al. (2023) Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
Liu et al. (2022) Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing order to abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2890–2903, Dublin, Ireland. Association for Computational Linguistics.
Meng et al. (2024) Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
Panickssery et al. (2024) Arjun Panickssery, Samuel R Bowman, and Shi Feng. 2024. LLM evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076.
Rafailov et al. (2024) Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. 2024. From $r$ to $q^{*}$ : Your language model is secretly a Q-function. arXiv preprint arXiv:2404.12358.
Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Tunstall et al. (2023) Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. 2023. Zephyr: Direct distillation of LM alignment. arXiv preprint arXiv:2310.16944.
Wang et al. (2023) Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
Xu et al. (2024) Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. 2024. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417.
Xu et al. (2023) Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. 2023. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682.
Yuan et al. (2023) Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
Zhao et al. (2023a) Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. 2023a. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
Zhao et al. (2023b) Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. 2023b. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations.
Ziebart et al. (2008) Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. 2008. Maximum entropy inverse reinforcement learning. In Proc. AAAI, pages 1433–1438.

Appendix A Additional Experimental Details

A.1 Datasets Details

Here we outline additional details regarding the datasets we used (§3). Following the setup of Zephyr (Tunstall et al., 2023), we use the UltraFeedback Cui et al. (2023) dataset, specifically its binarized version,¹⁷¹⁷17The dataset is available at https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. for DPO fine-tuning. The original UltraFeedback contains 64K data examples consisting of one user instruction and 4 candidate outputs generated by different LLMs, which are scored by GPT-4. Tunstall et al. (2023) binarized UltraFeedback to modify the data format for DPO fine-tuning, by constructing a positive-negative example pair out of the 4 candidate outputs. Specifically, the output with the highest GPT-4 score is selected as the positive example $y_{w}$ used in DPO fine-tuning (Eq. 5), where the negative example $y_{l}$ is randomly sampled from the rest 3 candidate outputs. We note that since UltraFeedback contains LLM-generated, GPT-4 graded outputs, the instruction fine-tuning conducted on UltraFeedback is a distillation setting from more capable LLMs, instead of learning directly from human feedback. UltraFeedback is released under the MIT license.

A.2 Additional Training Details

Here we discuss additional training details in §3. We use a linear learning rate scheduler during model training, with 10% of the steps allocated for learning rate warmup and the remainder for linearly decreasing the rate. We perform checkpoint selection based on the model validation loss on the UltraFeedback validation set, with a checkpointing interval of 500 steps. We conduct model training on 8 NVIDIA RTX 6000 Ada cards with 40GB memory each. The 3-epoch training takes around 9 hours to finish.

Appendix B Extending DPO: Posterior Probability as Reward Function

Following the analytical framework in Rafailov et al. (2023), we now prove the equivalence of Eq. 10 and Eq. 13 as previously discussed in §5.1.

B.1 Optimum of RL with Maximum Entropy

We will first derive the optimal policy, $p^{*}$ , for the RL with Maximum Entropy setting (Eq. 13):

$\max_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[r(x,y)]+\beta\mathrm{H}[p(y|x)]$ .

(20)

Given any reward function $r$ , we have

$\begin{aligned} &\max_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[r(x,y)]+\beta% \mathrm{H}[p(y|x)]\\ =&\max_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[r(x,y)]-\beta\log p(y|x)]\\ =&\min_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[\log p(y|x)-\frac{1}{\beta}r(x,y)]% \\ =&\min_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[\log\frac{p(y|x)}{\frac{1}{Z(x)}% \exp(\frac{1}{\beta}r(x,y))}-\log Z(x)],\end{aligned}$

(21)

where $Z(x)$ is the partition function:

$Z(x)=\sum_{y}\exp(\frac{1}{\beta}r(x,y))$ .

(22)

As $Z(x)$ is only a function of $x$ and does not depend on $p$ , we can define the following probability distribution:

$p^{*}(y|x)=\frac{1}{Z(x)}\exp(\frac{1}{\beta}r(x,y))$ .

(23)

Since $Z(x)$ is not a function of $y$ , we can re-organize the final objective of Eq. 21 as:

$\begin{aligned} &\min_{p}\mathbb{E}_{x\sim D}[\mathbb{E}_{y\sim p(y|x)}[\log% \frac{p(y|x)}{p^{*}(y|x)}]-\log Z(x)]\\ =&\min_{p}\mathbb{E}_{x\sim D}[\mathbb{D}_{\mathrm{KL}}[p(y|x)||p^{*}(y|x)]-% \log Z(x)]\end{aligned}$

(24)

Since $Z(x)$ does not depend on $p(y|x)$ , the final objective of Eq. 24 is equivalent to

$\min_{p}\mathbb{E}_{x\sim D}[\mathbb{D}_{\mathrm{KL}}[p(y|x)||p^{*}(y|x)]]$

(25)

By Gibbs’ inequality, the KL-divergence achieves the minimum of 0 if and only if the two distributions are identical. Therefore, we have the optimal solution of Eq. 25 (and Eq. 13) as

$p(y|x)=p^{*}(y|x)=\frac{1}{Z(x)}\exp(\frac{1}{\beta}r(x,y)).$

(26)

B.2 Extending DPO Objective for RL with Maximum Entropy

We can now express the reward function $r(x,y)$ under the RL with maximum entropy using the optimal policy $p^{*}(y|x)$ by rearranging Eq. 26:

$r^{*}(x,y)=\beta\log p^{*}(y|x)+\beta\log Z(x)$

(27)

Therefore, we know that the optimal policy $p^{*}$ under the BT model (Eq. 3) satisfies the following preference model:

$\begin{aligned} &p^{*}(y_{1}\succ y_{2}|x)\\ =&\frac{\exp(\beta\log p^{*}(y_{1}|x)+\beta\log Z(x))}{\exp(\beta\log p^{*}(y_% {1}|x)+\beta\log Z(x))+\exp(\beta\log p^{*}(y_{2}|x)+\beta\log Z(x))}\\ =&\frac{1}{1+\exp(\beta\log p^{*}(y_{1}|x)-\beta\log p^{*}(y_{2}|x))}\\ =&\sigma(\beta\log p^{*}(y_{1}|x)-\beta\log p^{*}(y_{2}|x)).\end{aligned}$

(28)

We can now derive the maximum likelihood estimation objective based on Eq. 28 for the model $p_{\theta}$ :

$\begin{aligned} \min_{p_{\theta}}&-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log% \sigma(\beta\log p_{\theta}(y_{w}|x)-\beta\log p_{\theta}(y_{l}|x))]\end{aligned}$

(29)

This is exactly the objective function we defined in §5.1 that uses the posterior probability as the reward function (Eq. 10).

B.3 Further Theoretical Analysis

We continue to follow the analytical framework of DPO to discuss the theoretical properties of the training objective defined in Eq. 29 (and Eq. 10). Specifically, as we discussed in §5.1, the training objective of Eq. 29 is equivalent to a BT model with the reparameterization:

r_{\theta}(x,y)=\beta\log p_{\theta}(y|x).

(30)

Following the proof framework in Rafailov et al. (2023), we will now prove:

all reward classes under the BT model can be represented with the reparameterization $r(x,y)=\beta\log p(y|x)$ for some model $p(y|x)$ .

Preparation

Our proof uses the definition of the equivalent class of reward functions given by Rafailov et al. (2023):

"Two reward functions $r(x,y)$ and $r^{\prime}(x,y)$ are equivalent iff $r(x,y)-r^{\prime}(x,y)=f(x)$ for some function $f$ .”

We also note a lemma given by Rafailov et al. (2023).

“Under the Bradley-Terry preference framework, two reward functions from the same class induce the same preference distribution.”

We now provide another lemma as an extension of a similar lemma in Rafailov et al. (2023):

Two reward functions from the same equivalent class induce the same optimal policy under the RL with Maximum Entropy setting.

Proof. Given two reward functions from the same class, such that $r^{\prime}(x,y)=r(x,y)+f(x)$ , from Eq. 26 we know that $r^{\prime}$ induces an optimal policy $p_{r^{\prime}}$ :

$\begin{aligned} p_{r^{\prime}}(y|x)&=\frac{1}{\sum_{y}\exp(\frac{1}{\beta}r^{% \prime}(x,y))}\exp(\frac{1}{\beta}r^{\prime}(x,y))\\ &=\frac{1}{\sum_{y}\exp(\frac{1}{\beta}(r(x,y)+f(x)))}\exp(\frac{1}{\beta}(r(x% ,y)+f(x)))\\ &=\frac{1}{\exp(\frac{1}{\beta}f(x))\sum_{y}\exp(\frac{1}{\beta}r(x,y))}\exp(% \frac{1}{\beta}r(x,y))\exp(\frac{1}{\beta}f(x))\\ &=\frac{1}{\sum_{y}\exp(\frac{1}{\beta}r(x,y))}\exp(\frac{1}{\beta}r(x,y))\\ &=p_{r}(y|x).\end{aligned}$

(31)

Proof end. As Rafailov et al. (2023) suggested, this lemma indicates that for a certain reward equivalence class, any arbitrary reward function within the class will induce the same optimal policy.

Proof

For any reward function $r(x,y)$ , from Appendix B.1, we know it induces the following optimal policy according to Eq. 26:

$p_{r}(y|x)=\frac{1}{Z(x)}\exp(\frac{1}{\beta}r(x,y)).$

(32)

Therefore, we have

$r(x,y)=\beta\log p_{r}(y|x)+\beta\log Z(x),$

(33)

where $Z(x)=\sum_{y}\exp(\frac{1}{\beta}r(x,y))$ . We now define a projection $f$ :

$f(r;\beta)=r(x,y)-\beta\log\sum_{y}\exp{(\frac{1}{\beta}r(x,y))},$

(34)

which projects $r$ into a new reward function:

$\begin{aligned} r^{\prime}(x,y)&=f(r;\beta)(x,y)\\ &=r(x,y)-\beta\log\sum_{y}\exp{(\frac{1}{\beta}r(x,y))}\\ &=\beta\log p_{r}(x,y).\end{aligned}$

(35)

We can observe that $r^{\prime}$ is within the equivalence class of $r$ since the projector $f$ only depends on $x$ . Our proof is therefore complete.

Appendix C Output Examples from Different Training Objectives

We show a specific output example for each of the training objectives investigated in §5. Compared with the output generated from the DPO objective in Table 6, the posterior probability objective (Eq. 10) produces longer outputs with more complex structures that appear overly verbose in Table 7. Conversely, the likelihood function objective (Eq. 18) results in less detailed outputs, as shown in Table 8.

Instruction

I like to host guests at my home from time to time, and I am gathering recipes of different dishes and drinks to keep things interesting. I am interested in trying some Somali dishes. Can you give me a recipe for Canjeero?

Output

Certainly! Here’s a traditional recipe for Canjeero (also known as Cannoli or Xaaj maanta in Somali), a popular Somali flatbread made with teff flour or a mix of teff and wheat flour.

Ingredients:
- 2 cups teff flour (you can find this at specialty stores or online)
- 1 cup wheat flour (or more teff flour if you prefer a stronger teff flavor)
- 2 tsp salt
- 1 package or 2 1/4 tsp (7g) active dry yeast
- 3 cups warm water (around 110°F)

Instructions:

1. In a large mixing bowl, combine the teff flour, wheat flour, and salt. Mix well.

2. In a separate bowl, dissolve the yeast in the warm water. Let it sit for 5 minutes until it becomes frothy.

3. Pour the yeast mixture into the flour mixture. Mix well until a smooth, sticky dough forms.

4. Let the dough rest in a warm place for 1-2 hours until it doubles in size.

5. When the dough has risen, divide it into six equal portions. Shape each portion into a ball.

6. Preheat a flat griddle or a non-stick pan over medium heat.

7. Take one ball of the dough and flatten it into a circle, about 1/4 inch thick.

8. Place the flattened dough on the hot griddle or pan. Cook for 2-3 minutes until the bottom is golden brown. Flip the Canjeero and cook for another 1-2 minutes until the other side is also golden brown.

9. Remove the Canjeero from the griddle or pan and keep it warm and moist by covering it with a clean towel or aluminum foil.

10. Repeat steps 7-9 with the remaining dough portions.

Canjeero is often served with a stew or soup. Enjoy!

Table 6: Output example with the DPO objective (Eq. 5).

Instruction

Output

Here’s a traditional and delicious Somali recipe for Canjeero (also known as Canjelo or Canjif) - a popular flatbread that’s often enjoyed for breakfast or as a snack. This dish is versatile and can be customized with various toppings to suit your preferences. Here’s how to make it:

Ingredients:
1. 2 cups all-purpose or bread flour
2. 1/2 cup fine semolina (sooji or rava)
3. 1 tsp salt
4. 1 1/2 cups water (or as needed)
5. 2 tbsp vegetable oil (or ghee/clarified butter)
6. Toppings of your choice:
    a. Yogurt
    b. Fresh or dried fruits (bananas, mangoes, dates, raisins)
    c. Honey or sugar
    d. Cardamom powder
    e. Peanuts or other nuts
    f. Qaat (a popular Somali leaf) - optional and not commonly found outside of Somalia

Instructions:

1. In a large mixing bowl, combine the all-purpose or bread flour, semolina, and salt. Mix well to combine.

2. Gradually pour in the water while stirring with a wooden or plastic spoon. Be cautious while adding water, as the dough can become too sticky if excess water is added. Add water until the dough comes together as a smooth ball.

3. Knead the dough for 3-4 minutes until it becomes soft and elastic. Cover the bowl with a damp cloth or plastic wrap and let the dough rest for 10-15 minutes. This allows the gluten to develop, resulting in a fluffier bread.

4. Heat a non-stick or cast-iron skillet on medium-high heat. Add the vegetable oil (or ghee/clarified butter) and let it heat up.

5. Divide the dough into equal-sized balls (usually about 6-8 pieces). Flatten each ball into a disc using your palms. Make sure the disc is not too thin, as this will result in a crispier bread.

6. Once the skillet is hot, carefully place the flattened discs into the skillet. Do not overcrowd the skillet; cook the bread in batches if necessary. Cook the bread for 1-2 minutes on each side, or until golden brown spots appear. Press down lightly on the bread with a spatula to fluff it up.

7. Remove the Canjeero from the skillet and place it on a plate. Repeat the process with the remaining dough balls.

8. Serve the hot Canjeero with your choice of toppings. Here are some popular combinations:
    a. Yogurt and honey or sugar
    b. Fresh fruit and cardamom powder
    c. Qaat and peanuts
    d. Peanut butter and banana
    e. Honey and sliced almonds
    f. Cinnamon and raisins

Note: Canjeero is a versatile dish that can be enjoyed any time of day. It’s often served with tea or coffee as a snack or as part of a larger meal. You can also experiment with different types of flour, such as whole-wheat or spelt, to create unique flavors.

Enjoy your homemade Canjeero! This dish is a staple in Somali cuisine and is a great way to explore the rich flavors and traditions of this vibrant culture. If you have any questions or need further assistance, please don’t hesitate to ask. Happy cooking!

Table 7: Output example with the posterior probability objective (Eq. 10).

Instruction

Output

Certainly! Here’s a simple recipe for Canjeero (Somali flatbread) that serves 4-6 people:

Ingredients:
- 2 cups (240 grams) of Somali flour (also known as “soor” or “sorghum flour”)
- 2 cups (480 ml) of water
- 1/2 teaspoon of salt

Instructions:

1. In a large mixing bowl, combine the somali flour and salt.

2. Gradually add the water to the flour mixture while stirring with a wooden spoon or spatula until the mixture is smooth, free of lumps, and has a thick consistency.

3. Cover the bowl with a clean kitchen towel or plastic wrap and let it rest at room temperature for at least 1 hour (or up to 24 hours) to allow the flour to hydrate and the gluten to develop.

4. When ready to make the canjeero, preheat a non-stick pan or griddle over medium heat.

5. Using a measuring cup or ladle, pour about 1/3 to 1/2 cup of batter into the center of the pan.

6. Immediately, use the back of the ladle or a spatula to gently spread the batter into a thin, round shape (about 6-8 inches in diameter).

7. Let the canjeero cook for 1-2 minutes or until the bottom is golden brown and the top is beginning to set.

8. Flip the canjeero over and let it cook for another 30 seconds to 1 minute on the other side, or until golden brown.

9. Repeat with the remaining batter, using more oil or butter to lubricate the pan as needed.

10. Serve the canjeero warm with a side of ghee (clarified butter) or other Somali stews or sauces like “Wat” or “Xalwo.”

Enjoy your traditional Somali Canjeero!

Table 8: Output example with the likelihood function objective (Eq. 18).