Understanding Reference Policies in Direct Preference Optimization

Yixin Liu1Pengfei Liu2Arman Cohan1,3
1Yale University 2Shanghai Jiao Tong University  3Allen Institute for AI
[email protected], [email protected], [email protected]
Abstract

Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO – its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO’s effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL-divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of reference policies for instruction fine-tuning by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO’s superiority. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.

Understanding Reference Policies in Direct Preference Optimization


Yixin Liu1  Pengfei Liu2  Arman Cohan1,3 1Yale University 2Shanghai Jiao Tong University  3Allen Institute for AI [email protected], [email protected], [email protected]


1 Introduction

Recently, alternative training algorithms have been proposed to address the limitations of the standard supervised fine-tuning (SFT) algorithm of large language models (LLMs). Specifically, unlike the SFT training objective which relies on a single gold reference, advanced training algorithms, such as Direct Preference Optimization (DPO) (Rafailov et al., 2023) and contrastive learning (Yuan et al., 2023; Zhao et al., 2023b), require the LLMs to learn to differentiate the quality of various candidate outputs and to generate outputs that are more likely to be preferred, i.e., preference learning Ouyang et al. (2022). Among such methods, DPO is especially intriguing because of its connection with reinforcement learning (RL). Specifically, Rafailov et al. (2023) show that an LLM trained under the KL-constrained RL setting implicitly learns a reward model which is used in the DPO training objective: rθ(x,y)=βlogpθ(y|x)pref(y|x)subscript𝑟𝜃𝑥𝑦𝛽subscript𝑝𝜃conditional𝑦𝑥subscript𝑝refconditional𝑦𝑥r_{\theta}(x,y)=\beta\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_β roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG, where x𝑥xitalic_x and y𝑦yitalic_y represent the input and output texts, pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and prefsubscript𝑝refp_{\mathrm{ref}}italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT are the distributions parameterized by the LLM being fine-tuned and the reference LLM respectively, and β𝛽\betaitalic_β controls the strength of the KL-divergence regularization applied from the reference LLM.

This reward function parameterization has interesting implications. Specifically, there can exist a discrepancy between the prediction of rθ(x,y)subscript𝑟𝜃𝑥𝑦r_{\theta}(x,y)italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ), the optimization target, and pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ), the learned distribution parameterized by the fine-tuned LLM. Consequently, given an input x𝑥xitalic_x and a pair of outputs ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT between which ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT has better quality, even if the reward model correctly predicts that rθ(x,yw)>rθ(x,yl)subscript𝑟𝜃𝑥subscript𝑦𝑤subscript𝑟𝜃𝑥subscript𝑦𝑙r_{\theta}(x,y_{w})>r_{\theta}(x,y_{l})italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) > italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ), it does not guarantee pθ(yw|x)>pθ(yl|x)subscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥subscript𝑝𝜃conditionalsubscript𝑦𝑙𝑥p_{\theta}(y_{w}|x)>p_{\theta}(y_{l}|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) > italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ), that the fine-tuned model learns to assign a higher probability to the better output Chen et al. (2024). We verified this observation in a preliminary study, where we found that the learned reward model rθsubscript𝑟𝜃r_{\theta}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has an accuracy of 76.3% while the learned distribution pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT only achieves an accuracy of 46.2% in comparing human-annotated output pairs, using DPO-tuned and reference LLMs in Tulu 2 Ivison et al. (2023).111Complete results are in §4.2.

The discrepancy noted above highlights an under-explored aspect of DPO – its dependence on the reference model/policy prefsubscript𝑝refp_{\mathrm{ref}}italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT.222We will interchangeably use the terms “model” and “policy” to reflect the contexts of both LLM fine-tuning and RL. The reference model in practice is usually instantiated by the SFT model, which is also the starting point for DPO fine-tuning (Rafailov et al., 2023). As a result, the fine-tuned model cannot arbitrarily deviate from, or thereby improve upon, the reference (SFT) model because the KL-divergence constraint penalizes such deviations, leading to the discrepancy in the ranking performance between rθsubscript𝑟𝜃r_{\theta}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. It thus seems that removing or reducing such constraints might be beneficial. Recent work Gorbatovski et al. (2024) does find that updating the reference policy helps improve DPO’s performance. However, eliminating this dependence can actually lead to performance degradation because of model degeneration (Rafailov et al., 2023). These findings suggest a complicated relationship between the reference policy and the DPO performance.

Therefore, we aim to further understand the effects and the role of the reference policy in DPO. To this end, we choose the task of instruction fine-tuning of LLMs Ouyang et al. (2022) as the test bed and two series of open-source pre-trained LLMs, Tulu 2 and Mistral Jiang et al. (2023), to analyze the reference policy in DPO on the AlpacaEval benchmark Li et al. (2023). Specifically, we investigate three main research questions (RQs).

First, assuming that reference policies are necessary for instruction fine-tuning, we explore RQ1: what the optimal strength of the regularization from the reference policy should be4). We find that a smaller constraint typically leads to better performance, until the constraint becomes too small, at which point performance begins to degrade. We then conduct an in-depth analysis of this phenomenon, examining it from both a sequence-level perspective, focusing on the ranking accuracy of learned models, and a token-level perspective, analyzing the fine-grained differences between the fine-tuned and reference models. This analysis reveals interesting findings. For example, after DPO fine-tuning, the predicted probability of the end-of-sequence token decreases by more than 10,000 times on average, with smaller constraints leading to greater differences in probability assignments.

Taking a step back, we also investigate RQ2: whether the reference policy is necessary for preference learning5). To this end, we examine two related reward parameterizations for instruction fine-tuning, specifically using either the posterior probability pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) or the likelihood function pθ(x|y)subscript𝑝𝜃conditional𝑥𝑦p_{\theta}(x|y)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y ) as the reward function, serving as an alternative to the DPO parameterization pθ(y|x)pref(y|x)subscript𝑝𝜃conditional𝑦𝑥subscript𝑝refconditional𝑦𝑥\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG. We discuss their connections with RL under the analytical framework proposed in DPO, as well as provide empirical comparisons. We find that while these alternative approaches can also improve the original SFT model performance, they cannot outperform DPO. The experimental results also reveal that the reference policy helps stabilize model behavior, as without it, the model can learn to generate outputs twice as long, potentially overfitting the human annotators’ preference for more verbose outputs.

Given the importance of the reference policy in the DPO objective, the third question (RQ3) we study is: whether DPO benefits from a stronger reference policy6). In detail, rather than the standard practice of using the SFT model to be fine-tuned as the reference model, we employ two more capable LLMs as the reference model. Our experiments reveal an interesting finding: stronger reference models can offer more benefits than the SFT model, but they only improve performance when compatible with fine-tuned model. This finding raises questions about the compatibility of the reference model with the model to be fine-tuned in DPO, suggesting a need for future work to analyze and measure their intrinsic similarity.

The experiments and analyses in this work emphasize the confounding effect of the reference policy in DPO. Our findings shed light on the best practices for DPO: (1) when using the SFT model as the reference policy, a smaller yet sufficiently strong constraint typically improves performance; (2) a stronger reference policy can improve performance but must be compatible with the model to be fine-tuned. Meanwhile, it also highlights open research questions regarding the necessity of the reference policy and calls for more theoretical analysis of the relationship between the reference policy and the DPO training performance and dynamics.333The training scripts and fine-tuned checkpoints are released in https://github.com/yale-nlp/refdpo.

2 Preliminaries

2.1 Instruction Fine-tuning

Instruction fine-tuning aims to align the pre-trained language models with the users’ intentions Ouyang et al. (2022). The frequently used training methods of instruction fine-tuning include standard supervised fine-tuning (SFT), reinforcement learning (RL) methods Stiennon et al. (2020); Ouyang et al. (2022) such as Proximal Policy Optimization Schulman et al. (2017), and reward-based training methods Rafailov et al. (2023); Yuan et al. (2023); Zhao et al. (2023a) such as Direct Preference Optimization Rafailov et al. (2023). Both RL and reward-based training methods offer advantages over standard SFT by leveraging human feedback, typically in the form of human-annotated rewards assigned to various candidates, enabling more effective alignment of LLMs with human preferences. Therefore, they are widely adopted in the instruction fine-tuning of large language models (LLMs) such as GPT-4 Achiam et al. (2023), Gemini Team et al. (2023), and Llama-3.444Llama-3’s technical report is at https://ai.meta.com/blog/meta-llama-3/.

2.2 Direct Preference Optimization

In Rafailov et al. (2023), DPO is proposed as an alternative to RL for training LLMs with human feedback. The DPO objective is derived from the KL-constrained reinforcement learning setting:

maxpθ𝔼xD,ypθ(y|x)[r(x,y)]β𝔻KL[pθ(y|x)||pref(y|x)],\max_{p_{\theta}}\mathbb{E}_{x\sim D,y\sim p_{\theta}(y|x)}[r(x,y)]-\beta% \mathbb{D}_{\mathrm{KL}}[p_{\theta}(y|x)||p_{\mathrm{ref}}(y|x)],roman_max start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y | italic_x ) ] ,

(1)

where x𝑥xitalic_x is the input text, y𝑦yitalic_y is the output text, pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the probability distribution parameterized by the LLM under fine-tuning with trainable parameters θ𝜃\thetaitalic_θ, prefsubscript𝑝refp_{\mathrm{ref}}italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT is a reference distribution which is often instantiated by another LLM, β𝛽\betaitalic_β is a parameter controlling the strength of the KL-divergence penalty:

𝔻KL[pθ||pref]\mathbb{D}_{\mathrm{KL}}[p_{\theta}||p_{\mathrm{ref}}]blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ] = 𝔼ypθ(y|x)[logpθ(y|x)pref(y|x)]subscript𝔼similar-to𝑦subscript𝑝𝜃conditional𝑦𝑥delimited-[]subscript𝑝𝜃conditional𝑦𝑥subscript𝑝refconditional𝑦𝑥\mathbb{E}_{y\sim p_{\theta}(y|x)}[\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}% (y|x)}]blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ].

(2)

r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) is the reward of y𝑦yitalic_y given x𝑥xitalic_x under the Bradley-Terry (BT) model Bradley and Terry (1952):

p(y1y2|x)=exp(r(x,y1))exp(r(x,y1))+exp(r(x,y2))𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥𝑟𝑥subscript𝑦1𝑟𝑥subscript𝑦1𝑟𝑥subscript𝑦2p(y_{1}\succ y_{2}|x)=\frac{\exp(r(x,y_{1}))}{\exp(r(x,y_{1}))+\exp(r(x,y_{2}))}italic_p ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = divide start_ARG roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) end_ARG start_ARG roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_ARG,

(3)

where p(y1y2|x)𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥p(y_{1}\succ y_{2}|x)italic_p ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) is the probability of output y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT being better than output y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This reward model can be optimized by maximum likelihood estimation:

(r)=𝔼(x,yw,yl)D[logp(ywyl|x)]=𝔼(x,yw,yl)D[logσ(r(x,yw)r(x,yl))].𝑟absentsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]𝑝succeedssubscript𝑦𝑤conditionalsubscript𝑦𝑙𝑥missing-subexpressionabsentsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]𝜎𝑟𝑥subscript𝑦𝑤𝑟𝑥subscript𝑦𝑙\begin{aligned} \mathcal{L}(r)&=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log p(y_{% w}\succ y_{l}|x)]\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(r(x,y_{w})-r(x,y_{l}))].\\ \end{aligned}start_ROW start_CELL caligraphic_L ( italic_r ) end_CELL start_CELL = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) ] . end_CELL end_ROW

(4)

Here x𝑥xitalic_x is an input in the dataset D𝐷Ditalic_D, ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is a pair of outputs between which ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT has better quality, and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the sigmoid function.

DPO shows that for the aforementioned RL objective (Eq. 1), it is equivalent to optimizing the following objective: learning a reward model parameterized by the model’s policy pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (and the reference policy prefsubscript𝑝refp_{\mathrm{ref}}italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT) under the BT model:

DPO(pθ;pref)=𝔼(x,yw,yl)D[logσ(βlogpθ(yw|x)pref(yw|x)βlogpθ(yl|x)pref(yl|x))],missing-subexpressionsubscriptDPOsubscript𝑝𝜃subscript𝑝refabsentmissing-subexpressionsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]𝜎𝛽subscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥subscript𝑝refconditionalsubscript𝑦𝑤𝑥𝛽subscript𝑝𝜃conditionalsubscript𝑦𝑙𝑥subscript𝑝refconditionalsubscript𝑦𝑙𝑥\begin{aligned} &\mathcal{L}_{\mathrm{DPO}}(p_{\theta};p_{\mathrm{ref}})=\\ &-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(\beta\log\frac{p_{\theta}(y_{w% }|x)}{p_{\mathrm{ref}}(y_{w}|x)}-\beta\log\frac{p_{\theta}(y_{l}|x)}{p_{% \mathrm{ref}}(y_{l}|x)})],\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ; italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] , end_CELL end_ROW

(5)

where a reward model is implicitly learned:

rθ(x,y)=βlogpθ(y|x)pref(y|x)subscript𝑟𝜃𝑥𝑦𝛽subscript𝑝𝜃conditional𝑦𝑥subscript𝑝refconditional𝑦𝑥r_{\theta}(x,y)=\beta\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_β roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG.

(6)

We note that the reference policy prefsubscript𝑝refp_{\mathrm{ref}}italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT used in DPO is typically the LLM after supervised fine-tuning, which is the same LLM that will be fine-tuned using DPO. For simplicity, we refer to this model as the SFT model.

2.3 Related Work on Preference Learning

Apart from DPO, various training methods of preference learning have been proposed recently. For example, IPO (Azar et al., 2024) has been proposed to address the potential limitation of DPO, which tends to overlook the KL-divergence regularization, making it easy to overfit. On the other hand, many preference learning algorithms are reference-policy-free. Among these, contrastive learning is a major category that interprets the model-predicted probability of an output as a quality score Liu et al. (2022); Zhao et al. (2023b); Yuan et al. (2023); Xu et al. (2024). Other related methods also share similar training objectives that are based on the model-predicted probability Xu et al. (2023); Hong et al. (2024); Meng et al. (2024). These methods have been shown to achieve better or comparable performance to DPO in various settings, without the need for a reference model. However, we note that additional normalization or regularization is still required for these methods. For example, an MLE objective on the positive examples is used by SLiC-HF Zhao et al. (2023a), CPO Xu et al. (2024), ORPO Hong et al. (2024), among others. Meanwhile, SimPO Meng et al. (2024) highlights the importance of length normalization of the sequence log-likelihood in improving method effectiveness. In this work, we focus on analyzing DPO since it has a more well-established theoretical background while presenting a comparison with reference-free training objectives in §5.

3 DPO for Instruction Fine-tuning

In this work, we center our analysis in the context of DPO’s applications in instruction fine-tuning. Therefore, here we outline the experimental settings for our analysis in the following sections.

Datasets

Following the setup of Zephyr (Tunstall et al., 2023), we use the UltraFeedback Cui et al. (2023) dataset, specifically its binarized version,555The dataset is available at https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. for DPO fine-tuning. It contains 64K data examples consisting of one user instruction and a positive-negative output pair scored by GPT-4.666Further details of datasets are in Appendix A.1.

Models

We mainly use two SFT LLMs as the starting point for DPO fine-tuning from two recent works on instruction fine-tuning. The first LLM, mistral-7b-sft-beta777https://huggingface.co/HuggingFaceH4/mistral-7b-sft-beta from Zephyr (Tunstall et al., 2023), is fine-tuned from the Mistral 7B base model (Jiang et al., 2023) on the UltraChat Ding et al. (2023) dataset.888https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k We will refer to this model as mistral-7b for simplicity. The second LLM, tulu-2-7b999https://huggingface.co/allenai/tulu-2-7b from Tulu 2 (Ivison et al., 2023), is fine-tuned from the Llama-2-7B base model Touvron et al. (2023) on the TULU v2 SFT mixture dataset.101010https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture We choose these two models for better transparency and reproducibility, since their experimental artifacts are publicly accessible.

Training Settings

We follow the training recipe proposed in Tunstall et al. (2023) for the experiments, which is also adopted by Ivison et al. (2023). Specifically, the models are fine-tuned with DPO for 3 epochs on the UltraFeedback dataset, with the batch size setting to 32 and linear learning rate scheduling with warmup.111111More details are in Appendix A.2. Tunstall et al. (2023) notes that while the models fine-tuned using DPO tend to overfit, this overfitting does not hurt the models’ performance, thus they picked the final checkpoint for evaluation. However, to better understand the effects of the DPO training objective on the final model performance, we instead select the checkpoints based on their loss on the validation set of the UltraFeedback dataset.

Evaluations

We mainly focus on the evaluations of the LLMs’ instruction-following capabilities. To this end, we use the AlpacaEval2 Li et al. (2023) benchmark for the main evaluation, which consists of 804 test examples for evaluating instruction-following and uses GPT-4 to evaluate the system outputs by comparing them with the outputs generated by GPT-4.121212gpt-4-1106-preview is used to generate the outputs and to perform the pairwise output comparison. Notably, its length-controlled version Dubois et al. (2024a), which mitigates the length bias of GPT-4’s evaluations by predicting and canceling it, achieves a 0.98 Spearman correlation with LMSYS’ Chatbot Arena Chiang et al. (2024) evaluation results collected from crowd-sourced human participants. Therefore, we mainly report the length-controlled AlpacaEval 2 scores in the following sections.

4 RQ1: What Is the Optimal KL Constraint Strength for DPO?

β𝛽\betaitalic_β mistral-7b tulu-2-7b
\infty (original) 7.57 8.50
0.1 (previous work*) 13.42 9.20
0.1 14.03 9.38
0.05 13.29 9.96
0.02 16.06 10.46
0.01 16.25 7.86
0.005 12.36 [degenerate]
Table 1: Model performance on AlpacaEval2 with varying values of β𝛽\betaitalic_β (regularization strength with respect to the reference model). The DPO fine-tuning is started from mistral-7b and tulu-2-7b respectively. *: the DPO fine-tuned models in previous work, Zephyr Tunstall et al. (2023) and Tulu-2 Ivison et al. (2023).

The regularization from the reference policy in DPO is introduced by the KL-divergence constraint (Eq. 2). Therefore, we start our analysis of reference policies by investigating the impact of this KL constraint. Specifically, in KL-constrained RL and DPO, the weighting coefficient β𝛽\betaitalic_β controls the trade-off between maximizing the reward r𝑟ritalic_r and minimizing the deviation from the reference policy prefsubscript𝑝refp_{\mathrm{ref}}italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT (Eq. 1). Therefore, we vary the value of β𝛽\betaitalic_β to understand the impact of this KL constraint.

4.1 Main Results

In Table 1, we show the models’ performance on AlpacaEval2 that are fine-tuned using DPO from mistral-7b and tulu-2-7b. Following the original setting of DPO, we also used these two SFT models as the reference models. We found a similar trend with the two series of fine-tuned models – a smaller KL constraint generally improves performance, until the constraint becomes too small and leads to performance degradation.

The above observation can be intuitively explained by the need for the fine-tuned model to deviate from the reference model, which is the SFT model targeted for improvement, while maintaining necessary regularization. However, we note the algorithm’s sensitivity to the strength of the KL constraint: while the studies by Zephyr Tunstall et al. (2023) and Tulu 2 Ivison et al. (2023) both set the value of β𝛽\betaitalic_β to be 0.10.10.10.1, the results in Table 1 suggest that a smaller optimal value of β𝛽\betaitalic_β may be more effective, yet with an increasing risk of performance degradation. Next, we present further analysis of this KL-constraint strength sensitivity.

4.2 Ranking Accuracy

Recent work Chen et al. (2024) noted a discrepancy between the learned (implicit) reward model and the learned policy model in DPO in terms of their capabilities of correctly ranking candidate outputs according to their quality. Particularly, the ranking accuracy of the learned reward model (Eq. 6) is usually high, indicating that the better output ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is assigned a higher reward. i.e., rθ(x,yw)>rθ(x,yl)subscript𝑟𝜃𝑥subscript𝑦𝑤subscript𝑟𝜃𝑥subscript𝑦𝑙r_{\theta}(x,y_{w})>r_{\theta}(x,y_{l})italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) > italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ). However, Chen et al. (2024) found that the ranking accuracy of the learned policy pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT remains low even after DPO fine-tuning. They partly attributed this to the generally low ranking accuracy of the reference policy prefsubscript𝑝refp_{\mathrm{ref}}italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT, noting that DPO fine-tuning rarely reverses the preference (i.e., p(yw|x)𝑝conditionalsubscript𝑦𝑤𝑥p(y_{w}|x)italic_p ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) vs. p(yl|x)𝑝conditionalsubscript𝑦𝑙𝑥p(y_{l}|x)italic_p ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x )) due to the KL constraint.

We noticed a similar trend in our experiments. Specifically, Table 2 shows the ranking accuracy of the learned reward model rθsubscript𝑟𝜃r_{\theta}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the policy model pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on the validation set of UltraFeedback. We observe a positive correlation between the strength of the KL constraint (β𝛽\betaitalic_β) and the ranking accuracy of pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT - a sufficiently small β𝛽\betaitalic_β is necessary to exceed the 50%percent5050\%50 % ranking accuracy of the random oracle baseline. Furthermore, a smaller β𝛽\betaitalic_β can lead to a higher policy ranking accuracy, however, as noted in §4.1, it can cause model degradation.131313Chen et al. (2024) has also noted that the ranking accuracy and the generation performance of the learned policy are not always positively correlated.

β𝛽\betaitalic_β mistral-7b tulu-2-7b
rθsubscript𝑟𝜃r_{\theta}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT rθsubscript𝑟𝜃r_{\theta}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
\infty (original) 0.500 0.435 0.500 0.439
0.1 0.718 0.495 0.773 0.462
0.05 0.744 0.513 0.778 0.483
0.02 0.746 0.544 0.766 0.536
0.01 0.751 0.605 0.758 0.605
0.005 0.766 0.704 0.750 0.680
Table 2: Ranking accuracy of the learned reward model rθ(x,y)subscript𝑟𝜃𝑥𝑦r_{\theta}(x,y)italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) and the policy model pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) on the Ultrafeedback validation set.

4.3 Token-Level Difference

Refer to caption
Figure 1: The average token log-probability difference distribution between the DPO-fine-tuned and reference models (Eq. 8). The models are fine-tuned from mistral-7b with different values of β𝛽\betaitalic_β.

Rafailov et al. (2024) have recently shown DPO implicitly parameterizes a dense reward function:

rθ(y(i),s(i))=βlogpθ(y(i)|x,y(<i))pref(y(i)|x,y(<i))subscript𝑟𝜃superscript𝑦𝑖superscript𝑠𝑖𝛽subscript𝑝𝜃conditionalsuperscript𝑦𝑖𝑥superscript𝑦absent𝑖subscript𝑝refconditionalsuperscript𝑦𝑖𝑥superscript𝑦absent𝑖r_{\theta}(y^{(i)},s^{(i)})=\beta\log\frac{p_{\theta}(y^{(i)}|x,y^{(<i)})}{p_{% \mathrm{ref}}(y^{(i)}|x,y^{(<i)})}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = italic_β roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x , italic_y start_POSTSUPERSCRIPT ( < italic_i ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x , italic_y start_POSTSUPERSCRIPT ( < italic_i ) end_POSTSUPERSCRIPT ) end_ARG,

(7)

where y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th token (action step) in the output y𝑦yitalic_y, y(<i)superscript𝑦absent𝑖y^{(<i)}italic_y start_POSTSUPERSCRIPT ( < italic_i ) end_POSTSUPERSCRIPT is the prefix of y𝑦yitalic_y before y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. s(i)superscript𝑠𝑖s^{(i)}italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the current “state”, which is determined by the input x𝑥xitalic_x and the prefix y(<i)superscript𝑦absent𝑖y^{(<i)}italic_y start_POSTSUPERSCRIPT ( < italic_i ) end_POSTSUPERSCRIPT. Intuitively, as noted by Rafailov et al. (2024), this formulation implies that DPO can learn token-level credit assignment, which is proportional to the difference between the log probabilities assigned by the trained model and the reference model. Therefore, we use the statistic, token log-probability difference, to study the token-level difference between the trained and reference models:

r^θ(y(i),s(i))=logpθ(y(i)|x,y(<i))pref(y(i)|x,y(<i))subscript^𝑟𝜃superscript𝑦𝑖superscript𝑠𝑖subscript𝑝𝜃conditionalsuperscript𝑦𝑖𝑥superscript𝑦absent𝑖subscript𝑝refconditionalsuperscript𝑦𝑖𝑥superscript𝑦absent𝑖\hat{r}_{\theta}(y^{(i)},s^{(i)})=\log\frac{p_{\theta}(y^{(i)}|x,y^{(<i)})}{p_% {\mathrm{ref}}(y^{(i)}|x,y^{(<i)})}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x , italic_y start_POSTSUPERSCRIPT ( < italic_i ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x , italic_y start_POSTSUPERSCRIPT ( < italic_i ) end_POSTSUPERSCRIPT ) end_ARG.

(8)

Figure 1 shows the distribution of the average log-probability difference of each token in the model’s vocabulary on the UltraFeedback validation set, with models fine-tuned from mistral-7b. We note the following:

(1) Similar to the findings in Rafailov et al. (2024), we found the DPO fine-tuning leads to a decrease in token probabilities in general.

(2) The distribution of token-level log-probability differences is imbalanced, with a small portion of tokens receiving very different probabilities.

(3) A smaller KL constraint leads to a larger portion of tokens with large probability differences.

Figure 2 displays the top 10 tokens with the largest probability differences that appear at least 100 times. Notably, the end-of-sequence token (EOS) is the most downweighted among them. Additionally, a smaller KL constraint results in a larger difference for the EOS token. Consequently, we observe that a smaller KL constraint generally leads to longer outputs, suggesting that the models have learned a preference for output length.

Refer to caption
Figure 2: The tokens most downweighted on average by the models DPO-fine-tuned from mistral-7b with different values of β𝛽\betaitalic_β.

Discussion

Our token-level analysis reveals that as the strength of the KL constraint decreases, the DPO-fine-tuned model begins to assign significantly different probabilities to a small subset of specific tokens compared to the reference model. The extremeness of these log-probability differences is notable: a difference of 1010-10- 10 indicates that the token is downweighted by e1022000superscript𝑒1022000e^{10}\approx 22000italic_e start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT ≈ 22000 times on average. This partially explains the sensitivity to β𝛽\betaitalic_β, as the increasing extremity of log-probability differences can destabilize the model. Moreover, it highlights a limitation of the sequence-level KL constraint: it does not effectively prevent extreme token-level differences.

5 RQ2: Is a Reference Policy Necessary for Effective Preference Learning?

We now take a step back to examine the necessity of using the SFT model as the reference policy for preference learning algorithms. As discussed in §2.3, there are already recent studies that proposed reference-free algorithms. We are, however, more interested in a closer comparison with DPO in terms of the training objective similarity and the connection with the RL objective, to better understand the role of the reference policies in DPO.

5.1 Training Objectives

To this end, we investigate two related training objectives and provide theoretical analyses in the framework established by Rafailov et al. (2023): optimizing the model using (1) the posterior probability (pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x )) and (2) the likelihood function (pθ(x|y)subscript𝑝𝜃conditional𝑥𝑦p_{\theta}(x|y)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y )) as the reward function in the BT model.

Posterior probability as the reward

Recall that Rafailov et al. (2023) have shown that the DPO training objective learns a reward model: rθ(x,y)=βlogpθ(y|x)pref(y|x)subscript𝑟𝜃𝑥𝑦𝛽subscript𝑝𝜃conditional𝑦𝑥subscript𝑝refconditional𝑦𝑥r_{\theta}(x,y)=\beta\log\frac{p_{\theta}(y|x)}{p_{\mathrm{ref}}(y|x)}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_β roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG. An alternative option is to use the posterior probability with a scaling hyperparameter β𝛽\betaitalic_β as the reward function, which does not require a reference model:

r~θ(x,y)=βlogpθ(y|x).subscript~𝑟𝜃𝑥𝑦𝛽subscript𝑝𝜃conditional𝑦𝑥\tilde{r}_{\theta}(x,y)=\beta\log p_{\theta}(y|x).over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) . (9)

The training objective under the BT model (Eq. 4) then becomes

Prob(pθ)=𝔼(x,yw,yl)D[logσ(βlogpθ(yw|x)βlogpθ(yl|x))].missing-subexpressionsubscriptProbsubscript𝑝𝜃absentmissing-subexpressionsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]𝜎𝛽subscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥𝛽subscript𝑝𝜃conditionalsubscript𝑦𝑙𝑥\begin{aligned} &\mathcal{L}_{\mathrm{Prob}}(p_{\theta})=\\ &-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(\beta\log p_{\theta}(y_{w}|x)-% \beta\log p_{\theta}(y_{l}|x))].\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT roman_Prob end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ) ] . end_CELL end_ROW

(10)

We note the connection between Eq. 10 and Expected Risk Minimization Edunov et al. (2018). Specifically, Prob(pθ)subscriptProbsubscript𝑝𝜃\mathcal{L}_{\mathrm{Prob}}(p_{\theta})caligraphic_L start_POSTSUBSCRIPT roman_Prob end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) can be rewritten as

Prob(pθ)=𝔼(x,yw,yl)D[logpθ(yw|x)βpθ(yw|x)β+pθ(yl|x)β],subscriptProbsubscript𝑝𝜃subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]subscript𝑝𝜃superscriptconditionalsubscript𝑦𝑤𝑥𝛽subscript𝑝𝜃superscriptconditionalsubscript𝑦𝑤𝑥𝛽subscript𝑝𝜃superscriptconditionalsubscript𝑦𝑙𝑥𝛽\mathcal{L}_{\mathrm{Prob}}(p_{\theta})=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[% \log\frac{p_{\theta}(y_{w}|x)^{\beta}}{p_{\theta}(y_{w}|x)^{\beta}+p_{\theta}(% y_{l}|x)^{\beta}}],caligraphic_L start_POSTSUBSCRIPT roman_Prob end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG ] ,

(11)

while the Expected Risk Minimization objective in a similar setting is

Risk(pθ)=𝔼(x,yw,yl)D[y{yw,yl}cost(y,x)pθ(yw|x)pθ(yw|x)+pθ(yl|x)].missing-subexpressionsubscriptRisksubscript𝑝𝜃absentmissing-subexpressionsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]subscript𝑦subscript𝑦𝑤subscript𝑦𝑙cost𝑦𝑥subscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥subscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥subscript𝑝𝜃conditionalsubscript𝑦𝑙𝑥\begin{aligned} &\mathcal{L}_{\mathrm{Risk}}(p_{\theta})=\\ &\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\sum_{y\in\{y_{w},y_{l}\}}\mathrm{cost}(y,% x)\frac{p_{\theta}(y_{w}|x)}{p_{\theta}(y_{w}|x)+p_{\theta}(y_{l}|x)}].\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT roman_Risk end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_y ∈ { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } end_POSTSUBSCRIPT roman_cost ( italic_y , italic_x ) divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) + italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ] . end_CELL end_ROW

(12)

Thus, when the cost function is binary, i.e., cost(yw,x)=0costsubscript𝑦𝑤𝑥0\mathrm{cost}(y_{w},x)=0roman_cost ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_x ) = 0 and cost(yl,x)=1costsubscript𝑦𝑙𝑥1\mathrm{cost}(y_{l},x)=1roman_cost ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) = 1, and the scaling factor β𝛽\betaitalic_β is set to 1, the only difference between Eq. 11 and Eq. 12 is the logarithm function in Eq. 11. This objective has also been investigated in recent work Xu et al. (2024) for fine-tuning LLMs for machine translation.

Furthermore, following the analytical framework in Rafailov et al. (2023), we can observe that optimizing LProb(pθ)subscript𝐿Probsubscript𝑝𝜃L_{\mathrm{Prob}}(p_{\theta})italic_L start_POSTSUBSCRIPT roman_Prob end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) is equivalent to optimizing the policy model under the following RL objective:

maxpθ𝔼xD,ypθ(y|x)[r(x,y)]+βH[pθ(y|x)]subscriptsubscript𝑝𝜃subscript𝔼formulae-sequencesimilar-to𝑥𝐷similar-to𝑦subscript𝑝𝜃conditional𝑦𝑥delimited-[]𝑟𝑥𝑦𝛽Hdelimited-[]subscript𝑝𝜃conditional𝑦𝑥\max_{p_{\theta}}\mathbb{E}_{x\sim D,y\sim p_{\theta}(y|x)}[r(x,y)]+\beta% \mathrm{H}[p_{\theta}(y|x)]roman_max start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] + italic_β roman_H [ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ],

(13)

where HH\mathrm{H}roman_H is the entropy function:

H[pθ(y|x)]=𝔼ypθ(y|x)[logpθ(y|x)].Hdelimited-[]subscript𝑝𝜃conditional𝑦𝑥subscript𝔼similar-to𝑦subscript𝑝𝜃conditional𝑦𝑥delimited-[]subscript𝑝𝜃conditional𝑦𝑥\mathrm{H}[p_{\theta}(y|x)]=\mathbb{E}_{y\sim p_{\theta}(y|x)}[-\log p_{\theta% }(y|x)].roman_H [ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] = blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] . (14)

The proof for the equivalence of Eq. 10 and Eq. 13 is in Appendix B, with further analyses showing that Eq. 10 shares similar properties as DPO.

We note that Eq. 13 resembles the Maximum Entropy (MaxEnt) RL setting Ziebart et al. (2008); Eysenbach and Levine (2021) – apart from optimizing for a higher reward, this objective discourages the model from becoming overly deterministic.

Likelihood function as the reward

Apart from the posterior probability, another candidate reward function is the likelihood function p(x|y)𝑝conditional𝑥𝑦p(x|y)italic_p ( italic_x | italic_y ). Specifically, by Bayes’ theorem, we have

p(y|x)p(x|y)p(y).proportional-to𝑝conditional𝑦𝑥𝑝conditional𝑥𝑦𝑝𝑦p(y|x)\propto p(x|y)p(y).italic_p ( italic_y | italic_x ) ∝ italic_p ( italic_x | italic_y ) italic_p ( italic_y ) . (15)

That is, the posterior distribution p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) that the model aims to learn can be factorized into the likelihood function p(x|y)𝑝conditional𝑥𝑦p(x|y)italic_p ( italic_x | italic_y ) and the prior distribution p(y)𝑝𝑦p(y)italic_p ( italic_y ). The prior distribution p(y)𝑝𝑦p(y)italic_p ( italic_y ) is the learning objective in the language model pre-training. Meanwhile, under the task of instruction-following, p(x|y)𝑝conditional𝑥𝑦p(x|y)italic_p ( italic_x | italic_y ) depicts how likely the instruction x𝑥xitalic_x is related to the model response y𝑦yitalic_y.

With the above factorization, we derive the learning objective with p(x|y)𝑝conditional𝑥𝑦p(x|y)italic_p ( italic_x | italic_y ) as the reward function. Specifically, using Bayes’ theorem again, we have

p(x|y)=p(y|x)p(x)p(y).𝑝conditional𝑥𝑦𝑝conditional𝑦𝑥𝑝𝑥𝑝𝑦p(x|y)=\frac{p(y|x)p(x)}{p(y)}.italic_p ( italic_x | italic_y ) = divide start_ARG italic_p ( italic_y | italic_x ) italic_p ( italic_x ) end_ARG start_ARG italic_p ( italic_y ) end_ARG .

(16)

To model p(x|y)𝑝conditional𝑥𝑦p(x|y)italic_p ( italic_x | italic_y ), we use the LLM under fine-tuning to parameterize p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ), i.e., pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ), and use a fixed LLM to parameterize p(y)𝑝𝑦p(y)italic_p ( italic_y ), which we label pLM(y)subscript𝑝LM𝑦p_{\mathrm{LM}}(y)italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y ).141414We use the SFT model in this work. We will show later that p(x)𝑝𝑥p(x)italic_p ( italic_x ) does not need an explicit parameterization. We have

pθ(x|y)=pθ(y|x)p(x)pLM(y).subscript𝑝𝜃conditional𝑥𝑦subscript𝑝𝜃conditional𝑦𝑥𝑝𝑥subscript𝑝LM𝑦p_{\theta}(x|y)=\frac{p_{\theta}(y|x)p(x)}{p_{\mathrm{LM}}(y)}.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y ) = divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_p ( italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y ) end_ARG .

(17)

Using pθ(x|y)subscript𝑝𝜃conditional𝑥𝑦p_{\theta}(x|y)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y ) as the reward function with a scaling factor β𝛽\betaitalic_β, the objective under the BT model (Eq. 4) becomes

Likelihood(pθ)=𝔼(x,yw,yl)D[logσ(βlogpθ(x|yw)βlogpθ(x|yl))]=𝔼(x,yw,yl)D[logpθ(x|yw)βpθ(x|yw)β+pθ(x|yl)β]=𝔼(x,yw,yl)D[log(pθ(yw|x)pLM(yw))β(pθ(yw|x)pLM(yw))β+(pθ(yl|x)pLM(yl))β]missing-subexpressionsubscriptLikelihoodsubscript𝑝𝜃missing-subexpressionabsentsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]𝜎𝛽subscript𝑝𝜃conditional𝑥subscript𝑦𝑤𝛽subscript𝑝𝜃conditional𝑥subscript𝑦𝑙missing-subexpressionabsentsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]subscript𝑝𝜃superscriptconditional𝑥subscript𝑦𝑤𝛽subscript𝑝𝜃superscriptconditional𝑥subscript𝑦𝑤𝛽subscript𝑝𝜃superscriptconditional𝑥subscript𝑦𝑙𝛽missing-subexpressionabsentsubscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]superscriptsubscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥subscript𝑝LMsubscript𝑦𝑤𝛽superscriptsubscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥subscript𝑝LMsubscript𝑦𝑤𝛽superscriptsubscript𝑝𝜃conditionalsubscript𝑦𝑙𝑥subscript𝑝LMsubscript𝑦𝑙𝛽\begin{aligned} &\mathcal{L}_{\mathrm{Likelihood}}(p_{\theta})\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\sigma(\beta\log p_{\theta}(x|y_{w})% -\beta\log p_{\theta}(x|y_{l}))]\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\frac{p_{\theta}(x|y_{w})^{\beta}}{p% _{\theta}(x|y_{w})^{\beta}+p_{\theta}(x|y_{l})^{\beta}}]\\ &=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log\frac{(\frac{p_{\theta}(y_{w}|x)}{p_% {\mathrm{LM}}(y_{w})})^{\beta}}{(\frac{p_{\theta}(y_{w}|x)}{p_{\mathrm{LM}}(y_% {w})})^{\beta}+(\frac{p_{\theta}(y_{l}|x)}{p_{\mathrm{LM}}(y_{l})})^{\beta}}]% \end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT roman_Likelihood end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log divide start_ARG ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG ] end_CELL end_ROW

(18)

Eq. 18 is equivalent to using pθ(y|x)pLM(y)subscript𝑝𝜃conditional𝑦𝑥subscript𝑝LM𝑦\frac{p_{\theta}(y|x)}{p_{\mathrm{LM}}(y)}divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y ) end_ARG as the reward function, which has an intuitive interpretation – the conditional-dependent quality score of y𝑦yitalic_y is its conditional probability given x𝑥xitalic_x normalized by its unconditional probability.

Furthermore, under the analytical framework of DPO, we note that Eq. 18 is equivalent to the KL-constrained RL setting (Eq. 1) with the prior probability pLM(y)subscript𝑝LM𝑦p_{\mathrm{LM}}(y)italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y ) as the reference policy. That is,

maxpθ𝔼xD,ypθ(y|x)[r(x,y)]β𝔻KL[pθ(y|x)||pLM(y)].\max_{p_{\theta}}\mathbb{E}_{x\sim D,y\sim p_{\theta}(y|x)}[r(x,y)]-\beta% \mathbb{D}_{\mathrm{KL}}[p_{\theta}(y|x)||p_{\mathrm{LM}}(y)].roman_max start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_p start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( italic_y ) ] .

(19)

5.2 Result Analysis

Method Score Accuracy Length 𝜷𝜷\boldsymbol{\beta}bold_italic_β
original (SFT) 7.57 0.435 205.0 -
DPO (Eq. 5) 16.25 0.605 359.4 0.01
probability (Eq. 10) 12.84 0.697 744.5 100.0
likelihood (Eq. 18) 13.63 0.602 389.3 0.01
Table 3: Performance comparison of different training objectives for fine-tuning mistral-7b. The best performance with the optimal β𝛽\betaitalic_β is reported. The AlpacaEval2 score, the ranking accuracy of the model’s probability, and the output length are reported.

We use mistral-7b as the base model for fine-tuning to compare DPO with the other two reward parameterizations discussed above. Table 3 summarizes the performance of different algorithms with the optimal value of β𝛽\betaitalic_β. We found that

(1) DPO outperforms other fine-tuning methods, and using the unconditional probability as the reference policy (Eq. 18) outperforms the reference-free training objective (Eq. 10).

(2) All fine-tuning methods improve the performance of the original SFT model;

(3) Using posterior probability as the reward function (Eq. 10) achieves the highest ranking accuracy based on the predicted probability and results in significantly longer outputs. This objective also results in a large optimal value of β𝛽\betaitalic_β, likely because increasing the entropy helps prevent degeneration.

Case Study

To further understand the difference in these training objectives, we present a case study with the EOS token. Specifically, as discussed in §4.3, the model trained using DPO learns to assign significantly different probabilities to the EOS token compared with the SFT model. We found a similar pattern with the other two training algorithms compared in this section. Therefore, here we provide a further investigation.

Specifically, we calculate two statistics: (1) the average log-probability assigned to the EOS token; (2) the average difference in the log-probability assigned to the EOS token between positive (ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT) and negative (ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) examples, illustrating how the model behaves differently in these two scenarios with the EOS token. Figure 3 presents these two statistics of models fine-tuned with different reward function parameterizations (averaged over different values of β𝛽\betaitalic_β). We note the difference between DPO and the fine-tuning method that uses posterior probability as the reward function: the former achieves a larger distinction between positive and negative examples in terms of the probability assignment to the EOS token, while the latter shows a smaller distinction, despite generally assigning more extreme probabilities to the EOS token, which leads to significantly longer outputs. This suggests that the reference policy in preference learning algorithms helps to stabilize the model behavior. We present a qualitative case study of the outputs generated with different training objectives in Appendix C.

Refer to caption
Figure 3: Average log-probability and the average log-probability difference between positive and negative examples of the EOS token, as assigned by models fine-tuned with different training objectives.

6 RQ3: Does DPO Benefit from Stronger Reference Policies?

In DPO the reference policy is instantiated using the SFT model (Rafailov et al., 2023) to be further fine-tuned. In the previous sections, we have demonstrated the importance of the reference policies in DPO. Now we investigate the next question: does DPO benefit from stronger reference policies?

6.1 Experimental Settings

We choose two LLMs as the stronger reference policies in DPO for the fine-tuning of mistral-7b and tulu-2-7b: (1) mistral-7b-instruct-v0.2 Jiang et al. (2023),151515https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 (2) llama-3-70b-instruct.161616https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct We chose these two models as they have shown better instruction-following capabilities than the two base models we used in DPO fine-tuning, achieving length-controlled AlpacaEval2 scores of 20.55 and 34.40 respectively. We refer to them as mistral-v0.2 and llama3 for simplicity.

6.2 Result Analysis

Table 4 shows the performance of models after DPO fine-tuning with various values of β𝛽\betaitalic_β with different reference models. For easier comparison, we also show the model performance with optimal values of β𝛽\betaitalic_β using different reference models in Table 5. We note the following:

(1) A stronger reference policy in DPO fine-tuning can improve DPO’s effectiveness. Specifically, the model fine-tuned from mistral-7b performs best with mistral-v0.2 as the reference, and the model fine-tuned from tulu-2-7b performs best with llama3 as the refererence.

(2) However, a stronger reference policy in DPO fine-tuning does not always lead to better performance. Notably, DPO fine-tuning of tulu-2-7b with mistral-v0.2 as the reference policy did not improve tulu-2-7b’s original performance. We provide further discussion about this in §6.3.

(3) The optimal KL constraint strength is larger with a stronger, suitable reference policy. In particular, as shown in Table 4, the optimal value of β𝛽\betaitalic_β for mistral-7b and tulu-2-7b is 1.0 with their respective suitable reference policies. Conversely, when the SFT model is used as the reference policy, the optimal β𝛽\betaitalic_β is 0.01 or 0.02, as shown in Table 1.

β𝛽\betaitalic_β mistral-7b tulu-2-7b
mistral-v0.2 llama3 mistral-v0.2 llama3
10.0 18.74 13.29 7.61 9.79
1.00 20.25 9.59 7.85 11.17
0.10 19.58 10.99 [degenerate] 10.31
0.01 17.18 15.37 [degenerate] 9.16
0.005 15.34 11.70 [degenerate] 3.29
Table 4: Model performance on AlpacaEval2 with varying values of β𝛽\betaitalic_β fine-tuned from mistral-7b and tulu-2-7b. Two reference models, mistral-v0.2 and llama3 are used.
original self mistral-v0.2 llama3
mistral-7b 7.57 16.25 20.25 15.37
tulu-2-7b 8.50 10.46 7.85 11.17
Table 5: Model performance on AlpacaEval2. The best performance across different values of β𝛽\betaitalic_β with different reference models is compared, including the SFT model itself (self), mistral-v0.2, and llama3.

6.3 Discussion

Our investigation reveals while a stronger reference model can be helpful in DPO, it is not always the case. One possible explanation is that the reference model should be similar enough to the model under training to be compatible. This could explain why mistral-7b performs better with mistral-v0.2 as a reference, while tulu-2-7b performs better with llama3: the first pair are fine-tuned from the same base model. Meanwhile, tulu-2-7b is fine-tuned from llama-2-7b, which shares a similar model architecture with llama3 and likely uses similar pre-training corpora as well.

Regarding model similarity and the choice of the reference policy, we acknowledge the need for analytical tools and call for further research in this area. The KL divergence is a potential method, but estimating it is challenging due to the large sampling space, leading to high variance.

We note an interesting aspect in the fine-tuning of tulu-2-7b using llama3 as the reference policy. Specifically, these models use different tokenizers, meaning that they do not share the same token-level action space. The fact that this combination yields further improvement suggests that action spaces do not always need to be aligned in DPO, despite related work proving that DPO can implicitly parameterize a dense reward function, which requires a consistent action space Rafailov et al. (2024).

7 Conclusion

In this work, we studied the effects of reference policies in DPO, which can stabilize the training while introducing a potential performance upperbound. We found that DPO’s performance is sensitive to the strength of constraints from the reference policy and demonstrated the benefits of using reference policies in preference learning by comparing DPO to related training objectives. We also showed that a stronger reference policy can enhance DPO performance, but only with a compatible reference policy. Our findings highlight the confounding role of reference policies in DPO, providing empirical insights and emphasizing the need for more in-depth theoretical analysis. We call for future work to provide more theoretical and empirical guidelines for choosing the constraint strength from the reference policy, as well as understanding the similarity and compatibility between the model to be trained and the reference model.

Limitations

Evaluations

We mainly use the AlpacaEval2 benchmark to evaluate the instruction-following capabilities of LLMs in this work, on which the LLM-based evaluation achieves a high correlation with human evaluations Dubois et al. (2024a). The test examples on this benchmark cover a wide range of instruction types Dubois et al. (2024b), which enhances the generalizability of the evaluation results. However, we acknowledge that the LLM-based automatic evaluation has limitations, such as low self-consistency rates in their predictions and a preference for the LLMs’ own outputs Liu et al. (2023); Wang et al. (2023); Panickssery et al. (2024). Therefore, while we did not perform human evaluations because the large number of analyses we conducted makes it cost-ineffective, we acknowledge its importance in terms of achieving more faithful evaluation results.

Model Scale

The LLMs we studied in this work primarily consist of around 7 billion parameters. Due to computing resource constraints, we did not investigate LLMs with more parameters. Therefore, whether our observations are applicable on a larger scale remains to be investigated.

Acknowledgements

We thank Hamish Ivison for helpful suggestions regarding the training details of DPO. We are grateful for the TPU compute support provided by the Google TRC program.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  • Azar et al. (2024) Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. 2024. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR.
  • Bradley and Terry (1952) Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  • Chen et al. (2024) Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, and Kyunghyun Cho. 2024. Preference learning algorithms do not learn preference rankings. arXiv preprint arXiv:2405.19534.
  • Chiang et al. (2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.
  • Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  • Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, Singapore. Association for Computational Linguistics.
  • Dubois et al. (2024a) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024a. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
  • Dubois et al. (2024b) Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. 2024b. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36.
  • Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 355–364, New Orleans, Louisiana. Association for Computational Linguistics.
  • Eysenbach and Levine (2021) Benjamin Eysenbach and Sergey Levine. 2021. Maximum entropy RL (provably) solves some robust RL problems. In International Conference on Learning Representations.
  • Gorbatovski et al. (2024) Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. 2024. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
  • Hong et al. (2024) Jiwoo Hong, Noah Lee, and James Thorne. 2024. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
  • Ivison et al. (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023. Camels in a changing climate: Enhancing lm adaptation with tulu 2. Preprint, arXiv:2311.10702.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Li et al. (2023) Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  • Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  • Liu et al. (2022) Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing order to abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2890–2903, Dublin, Ireland. Association for Computational Linguistics.
  • Meng et al. (2024) Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  • Panickssery et al. (2024) Arjun Panickssery, Samuel R Bowman, and Shi Feng. 2024. LLM evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076.
  • Rafailov et al. (2024) Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. 2024. From r𝑟ritalic_r to qsuperscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a Q-function. arXiv preprint arXiv:2404.12358.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Tunstall et al. (2023) Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. 2023. Zephyr: Direct distillation of LM alignment. arXiv preprint arXiv:2310.16944.
  • Wang et al. (2023) Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  • Xu et al. (2024) Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. 2024. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417.
  • Xu et al. (2023) Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. 2023. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682.
  • Yuan et al. (2023) Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Zhao et al. (2023a) Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. 2023a. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
  • Zhao et al. (2023b) Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. 2023b. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations.
  • Ziebart et al. (2008) Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. 2008. Maximum entropy inverse reinforcement learning. In Proc. AAAI, pages 1433–1438.

Appendix A Additional Experimental Details

A.1 Datasets Details

Here we outline additional details regarding the datasets we used (§3). Following the setup of Zephyr (Tunstall et al., 2023), we use the UltraFeedback Cui et al. (2023) dataset, specifically its binarized version,171717The dataset is available at https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. for DPO fine-tuning. The original UltraFeedback contains 64K data examples consisting of one user instruction and 4 candidate outputs generated by different LLMs, which are scored by GPT-4. Tunstall et al. (2023) binarized UltraFeedback to modify the data format for DPO fine-tuning, by constructing a positive-negative example pair out of the 4 candidate outputs. Specifically, the output with the highest GPT-4 score is selected as the positive example ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT used in DPO fine-tuning (Eq. 5), where the negative example ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is randomly sampled from the rest 3 candidate outputs. We note that since UltraFeedback contains LLM-generated, GPT-4 graded outputs, the instruction fine-tuning conducted on UltraFeedback is a distillation setting from more capable LLMs, instead of learning directly from human feedback. UltraFeedback is released under the MIT license.

A.2 Additional Training Details

Here we discuss additional training details in §3. We use a linear learning rate scheduler during model training, with 10% of the steps allocated for learning rate warmup and the remainder for linearly decreasing the rate. We perform checkpoint selection based on the model validation loss on the UltraFeedback validation set, with a checkpointing interval of 500 steps. We conduct model training on 8 NVIDIA RTX 6000 Ada cards with 40GB memory each. The 3-epoch training takes around 9 hours to finish.

Appendix B Extending DPO: Posterior Probability as Reward Function

Following the analytical framework in Rafailov et al. (2023), we now prove the equivalence of Eq. 10 and Eq. 13 as previously discussed in §5.1.

B.1 Optimum of RL with Maximum Entropy

We will first derive the optimal policy, psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, for the RL with Maximum Entropy setting (Eq. 13):

maxp𝔼xD,yp(y|x)[r(x,y)]+βH[p(y|x)]subscript𝑝subscript𝔼formulae-sequencesimilar-to𝑥𝐷similar-to𝑦𝑝conditional𝑦𝑥delimited-[]𝑟𝑥𝑦𝛽Hdelimited-[]𝑝conditional𝑦𝑥\max_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[r(x,y)]+\beta\mathrm{H}[p(y|x)]roman_max start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] + italic_β roman_H [ italic_p ( italic_y | italic_x ) ].

(20)

Given any reward function r𝑟ritalic_r, we have

maxp𝔼xD,yp(y|x)[r(x,y)]+βH[p(y|x)]=maxp𝔼xD,yp(y|x)[r(x,y)]βlogp(y|x)]=minp𝔼xD,yp(y|x)[logp(y|x)1βr(x,y)]=minp𝔼xD,yp(y|x)[logp(y|x)1Z(x)exp(1βr(x,y))logZ(x)],\begin{aligned} &\max_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[r(x,y)]+\beta% \mathrm{H}[p(y|x)]\\ =&\max_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[r(x,y)]-\beta\log p(y|x)]\\ =&\min_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[\log p(y|x)-\frac{1}{\beta}r(x,y)]% \\ =&\min_{p}\mathbb{E}_{x\sim D,y\sim p(y|x)}[\log\frac{p(y|x)}{\frac{1}{Z(x)}% \exp(\frac{1}{\beta}r(x,y))}-\log Z(x)],\end{aligned}start_ROW start_CELL end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] + italic_β roman_H [ italic_p ( italic_y | italic_x ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - italic_β roman_log italic_p ( italic_y | italic_x ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_min start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p ( italic_y | italic_x ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_y | italic_x ) - divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_min start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D , italic_y ∼ italic_p ( italic_y | italic_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( italic_y | italic_x ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_ARG - roman_log italic_Z ( italic_x ) ] , end_CELL end_ROW

(21)

where Z(x)𝑍𝑥Z(x)italic_Z ( italic_x ) is the partition function:

Z(x)=yexp(1βr(x,y))𝑍𝑥subscript𝑦1𝛽𝑟𝑥𝑦Z(x)=\sum_{y}\exp(\frac{1}{\beta}r(x,y))italic_Z ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ).

(22)

As Z(x)𝑍𝑥Z(x)italic_Z ( italic_x ) is only a function of x𝑥xitalic_x and does not depend on p𝑝pitalic_p, we can define the following probability distribution:

p(y|x)=1Z(x)exp(1βr(x,y))superscript𝑝conditional𝑦𝑥1𝑍𝑥1𝛽𝑟𝑥𝑦p^{*}(y|x)=\frac{1}{Z(x)}\exp(\frac{1}{\beta}r(x,y))italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ).

(23)

Since Z(x)𝑍𝑥Z(x)italic_Z ( italic_x ) is not a function of y𝑦yitalic_y, we can re-organize the final objective of Eq. 21 as:

minp𝔼xD[𝔼yp(y|x)[logp(y|x)p(y|x)]logZ(x)]=minp𝔼xD[𝔻KL[p(y|x)||p(y|x)]logZ(x)]\begin{aligned} &\min_{p}\mathbb{E}_{x\sim D}[\mathbb{E}_{y\sim p(y|x)}[\log% \frac{p(y|x)}{p^{*}(y|x)}]-\log Z(x)]\\ =&\min_{p}\mathbb{E}_{x\sim D}[\mathbb{D}_{\mathrm{KL}}[p(y|x)||p^{*}(y|x)]-% \log Z(x)]\end{aligned}start_ROW start_CELL end_CELL start_CELL roman_min start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_p ( italic_y | italic_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( italic_y | italic_x ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_ARG ] - roman_log italic_Z ( italic_x ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_min start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D end_POSTSUBSCRIPT [ blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_p ( italic_y | italic_x ) | | italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ] - roman_log italic_Z ( italic_x ) ] end_CELL end_ROW

(24)

Since Z(x)𝑍𝑥Z(x)italic_Z ( italic_x ) does not depend on p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ), the final objective of Eq. 24 is equivalent to

minp𝔼xD[𝔻KL[p(y|x)||p(y|x)]]\min_{p}\mathbb{E}_{x\sim D}[\mathbb{D}_{\mathrm{KL}}[p(y|x)||p^{*}(y|x)]]roman_min start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D end_POSTSUBSCRIPT [ blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_p ( italic_y | italic_x ) | | italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ] ]

(25)

By Gibbs’ inequality, the KL-divergence achieves the minimum of 0 if and only if the two distributions are identical. Therefore, we have the optimal solution of Eq. 25 (and Eq. 13) as

p(y|x)=p(y|x)=1Z(x)exp(1βr(x,y)).𝑝conditional𝑦𝑥superscript𝑝conditional𝑦𝑥1𝑍𝑥1𝛽𝑟𝑥𝑦p(y|x)=p^{*}(y|x)=\frac{1}{Z(x)}\exp(\frac{1}{\beta}r(x,y)).italic_p ( italic_y | italic_x ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) .

(26)

B.2 Extending DPO Objective for RL with Maximum Entropy

We can now express the reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) under the RL with maximum entropy using the optimal policy p(y|x)superscript𝑝conditional𝑦𝑥p^{*}(y|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) by rearranging Eq. 26:

r(x,y)=βlogp(y|x)+βlogZ(x)superscript𝑟𝑥𝑦𝛽superscript𝑝conditional𝑦𝑥𝛽𝑍𝑥r^{*}(x,y)=\beta\log p^{*}(y|x)+\beta\log Z(x)italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) + italic_β roman_log italic_Z ( italic_x )

(27)

Therefore, we know that the optimal policy psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under the BT model (Eq. 3) satisfies the following preference model:

p(y1y2|x)=exp(βlogp(y1|x)+βlogZ(x))exp(βlogp(y1|x)+βlogZ(x))+exp(βlogp(y2|x)+βlogZ(x))=11+exp(βlogp(y1|x)βlogp(y2|x))=σ(βlogp(y1|x)βlogp(y2|x)).missing-subexpressionsuperscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥𝛽superscript𝑝conditionalsubscript𝑦1𝑥𝛽𝑍𝑥𝛽superscript𝑝conditionalsubscript𝑦1𝑥𝛽𝑍𝑥𝛽superscript𝑝conditionalsubscript𝑦2𝑥𝛽𝑍𝑥11𝛽superscript𝑝conditionalsubscript𝑦1𝑥𝛽superscript𝑝conditionalsubscript𝑦2𝑥𝜎𝛽superscript𝑝conditionalsubscript𝑦1𝑥𝛽superscript𝑝conditionalsubscript𝑦2𝑥\begin{aligned} &p^{*}(y_{1}\succ y_{2}|x)\\ =&\frac{\exp(\beta\log p^{*}(y_{1}|x)+\beta\log Z(x))}{\exp(\beta\log p^{*}(y_% {1}|x)+\beta\log Z(x))+\exp(\beta\log p^{*}(y_{2}|x)+\beta\log Z(x))}\\ =&\frac{1}{1+\exp(\beta\log p^{*}(y_{1}|x)-\beta\log p^{*}(y_{2}|x))}\\ =&\sigma(\beta\log p^{*}(y_{1}|x)-\beta\log p^{*}(y_{2}|x)).\end{aligned}start_ROW start_CELL end_CELL start_CELL italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG roman_exp ( italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) + italic_β roman_log italic_Z ( italic_x ) ) end_ARG start_ARG roman_exp ( italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) + italic_β roman_log italic_Z ( italic_x ) ) + roman_exp ( italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) + italic_β roman_log italic_Z ( italic_x ) ) end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) ) end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL italic_σ ( italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) ) . end_CELL end_ROW

(28)

We can now derive the maximum likelihood estimation objective based on Eq. 28 for the model pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT:

minpθ𝔼(x,yw,yl)D[logσ(βlogpθ(yw|x)βlogpθ(yl|x))]subscriptsubscript𝑝𝜃subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷delimited-[]𝜎𝛽subscript𝑝𝜃conditionalsubscript𝑦𝑤𝑥𝛽subscript𝑝𝜃conditionalsubscript𝑦𝑙𝑥\begin{aligned} \min_{p_{\theta}}&-\mathbb{E}_{(x,y_{w},y_{l})\sim D}[\log% \sigma(\beta\log p_{\theta}(y_{w}|x)-\beta\log p_{\theta}(y_{l}|x))]\end{aligned}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ) ] end_CELL end_ROW

(29)

This is exactly the objective function we defined in §5.1 that uses the posterior probability as the reward function (Eq. 10).

B.3 Further Theoretical Analysis

We continue to follow the analytical framework of DPO to discuss the theoretical properties of the training objective defined in Eq. 29 (and Eq. 10). Specifically, as we discussed in §5.1, the training objective of Eq. 29 is equivalent to a BT model with the reparameterization:

rθ(x,y)=βlogpθ(y|x).subscript𝑟𝜃𝑥𝑦𝛽subscript𝑝𝜃conditional𝑦𝑥r_{\theta}(x,y)=\beta\log p_{\theta}(y|x).italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_β roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) . (30)

Following the proof framework in Rafailov et al. (2023), we will now prove:

all reward classes under the BT model can be represented with the reparameterization r(x,y)=βlogp(y|x)𝑟𝑥𝑦𝛽𝑝conditional𝑦𝑥r(x,y)=\beta\log p(y|x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_p ( italic_y | italic_x ) for some model p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ).

Preparation

Our proof uses the definition of the equivalent class of reward functions given by Rafailov et al. (2023):

"Two reward functions r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) and r(x,y)superscript𝑟𝑥𝑦r^{\prime}(x,y)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) are equivalent iff r(x,y)r(x,y)=f(x)𝑟𝑥𝑦superscript𝑟𝑥𝑦𝑓𝑥r(x,y)-r^{\prime}(x,y)=f(x)italic_r ( italic_x , italic_y ) - italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_f ( italic_x ) for some function f𝑓fitalic_f.

We also note a lemma given by Rafailov et al. (2023).

Under the Bradley-Terry preference framework, two reward functions from the same class induce the same preference distribution.

We now provide another lemma as an extension of a similar lemma in Rafailov et al. (2023):

Two reward functions from the same equivalent class induce the same optimal policy under the RL with Maximum Entropy setting.

Proof. Given two reward functions from the same class, such that r(x,y)=r(x,y)+f(x)superscript𝑟𝑥𝑦𝑟𝑥𝑦𝑓𝑥r^{\prime}(x,y)=r(x,y)+f(x)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_r ( italic_x , italic_y ) + italic_f ( italic_x ), from Eq. 26 we know that rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT induces an optimal policy prsubscript𝑝superscript𝑟p_{r^{\prime}}italic_p start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT:

pr(y|x)=1yexp(1βr(x,y))exp(1βr(x,y))=1yexp(1β(r(x,y)+f(x)))exp(1β(r(x,y)+f(x)))=1exp(1βf(x))yexp(1βr(x,y))exp(1βr(x,y))exp(1βf(x))=1yexp(1βr(x,y))exp(1βr(x,y))=pr(y|x).subscript𝑝superscript𝑟conditional𝑦𝑥absent1subscript𝑦1𝛽superscript𝑟𝑥𝑦1𝛽superscript𝑟𝑥𝑦missing-subexpressionabsent1subscript𝑦1𝛽𝑟𝑥𝑦𝑓𝑥1𝛽𝑟𝑥𝑦𝑓𝑥missing-subexpressionabsent11𝛽𝑓𝑥subscript𝑦1𝛽𝑟𝑥𝑦1𝛽𝑟𝑥𝑦1𝛽𝑓𝑥missing-subexpressionabsent1subscript𝑦1𝛽𝑟𝑥𝑦1𝛽𝑟𝑥𝑦missing-subexpressionabsentsubscript𝑝𝑟conditional𝑦𝑥\begin{aligned} p_{r^{\prime}}(y|x)&=\frac{1}{\sum_{y}\exp(\frac{1}{\beta}r^{% \prime}(x,y))}\exp(\frac{1}{\beta}r^{\prime}(x,y))\\ &=\frac{1}{\sum_{y}\exp(\frac{1}{\beta}(r(x,y)+f(x)))}\exp(\frac{1}{\beta}(r(x% ,y)+f(x)))\\ &=\frac{1}{\exp(\frac{1}{\beta}f(x))\sum_{y}\exp(\frac{1}{\beta}r(x,y))}\exp(% \frac{1}{\beta}r(x,y))\exp(\frac{1}{\beta}f(x))\\ &=\frac{1}{\sum_{y}\exp(\frac{1}{\beta}r(x,y))}\exp(\frac{1}{\beta}r(x,y))\\ &=p_{r}(y|x).\end{aligned}start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y | italic_x ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( italic_r ( italic_x , italic_y ) + italic_f ( italic_x ) ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( italic_r ( italic_x , italic_y ) + italic_f ( italic_x ) ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_f ( italic_x ) ) ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_f ( italic_x ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) . end_CELL end_ROW

(31)

Proof end. As Rafailov et al. (2023) suggested, this lemma indicates that for a certain reward equivalence class, any arbitrary reward function within the class will induce the same optimal policy.

Proof

For any reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ), from Appendix B.1, we know it induces the following optimal policy according to Eq. 26:

pr(y|x)=1Z(x)exp(1βr(x,y)).subscript𝑝𝑟conditional𝑦𝑥1𝑍𝑥1𝛽𝑟𝑥𝑦p_{r}(y|x)=\frac{1}{Z(x)}\exp(\frac{1}{\beta}r(x,y)).italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) .

(32)

Therefore, we have

r(x,y)=βlogpr(y|x)+βlogZ(x),𝑟𝑥𝑦𝛽subscript𝑝𝑟conditional𝑦𝑥𝛽𝑍𝑥r(x,y)=\beta\log p_{r}(y|x)+\beta\log Z(x),italic_r ( italic_x , italic_y ) = italic_β roman_log italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) + italic_β roman_log italic_Z ( italic_x ) ,

(33)

where Z(x)=yexp(1βr(x,y))𝑍𝑥subscript𝑦1𝛽𝑟𝑥𝑦Z(x)=\sum_{y}\exp(\frac{1}{\beta}r(x,y))italic_Z ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ). We now define a projection f𝑓fitalic_f:

f(r;β)=r(x,y)βlogyexp(1βr(x,y)),𝑓𝑟𝛽𝑟𝑥𝑦𝛽subscript𝑦1𝛽𝑟𝑥𝑦f(r;\beta)=r(x,y)-\beta\log\sum_{y}\exp{(\frac{1}{\beta}r(x,y))},italic_f ( italic_r ; italic_β ) = italic_r ( italic_x , italic_y ) - italic_β roman_log ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) ,

(34)

which projects r𝑟ritalic_r into a new reward function:

r(x,y)=f(r;β)(x,y)=r(x,y)βlogyexp(1βr(x,y))=βlogpr(x,y).superscript𝑟𝑥𝑦absent𝑓𝑟𝛽𝑥𝑦missing-subexpressionabsent𝑟𝑥𝑦𝛽subscript𝑦1𝛽𝑟𝑥𝑦missing-subexpressionabsent𝛽subscript𝑝𝑟𝑥𝑦\begin{aligned} r^{\prime}(x,y)&=f(r;\beta)(x,y)\\ &=r(x,y)-\beta\log\sum_{y}\exp{(\frac{1}{\beta}r(x,y))}\\ &=\beta\log p_{r}(x,y).\end{aligned}start_ROW start_CELL italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) end_CELL start_CELL = italic_f ( italic_r ; italic_β ) ( italic_x , italic_y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_r ( italic_x , italic_y ) - italic_β roman_log ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_β roman_log italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) . end_CELL end_ROW

(35)

We can observe that rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is within the equivalence class of r𝑟ritalic_r since the projector f𝑓fitalic_f only depends on x𝑥xitalic_x. Our proof is therefore complete.

Appendix C Output Examples from Different Training Objectives

We show a specific output example for each of the training objectives investigated in §5. Compared with the output generated from the DPO objective in Table 6, the posterior probability objective (Eq. 10) produces longer outputs with more complex structures that appear overly verbose in Table 7. Conversely, the likelihood function objective (Eq. 18) results in less detailed outputs, as shown in Table 8.

Instruction
I like to host guests at my home from time to time, and I am gathering recipes of different dishes and drinks to keep things interesting. I am interested in trying some Somali dishes. Can you give me a recipe for Canjeero?
Output
Certainly! Here’s a traditional recipe for Canjeero (also known as Cannoli or Xaaj maanta in Somali), a popular Somali flatbread made with teff flour or a mix of teff and wheat flour.

Ingredients:
- 2 cups teff flour (you can find this at specialty stores or online)
- 1 cup wheat flour (or more teff flour if you prefer a stronger teff flavor)
- 2 tsp salt
- 1 package or 2 1/4 tsp (7g) active dry yeast
- 3 cups warm water (around 110°F)

Instructions:

1. In a large mixing bowl, combine the teff flour, wheat flour, and salt. Mix well.

2. In a separate bowl, dissolve the yeast in the warm water. Let it sit for 5 minutes until it becomes frothy.

3. Pour the yeast mixture into the flour mixture. Mix well until a smooth, sticky dough forms.

4. Let the dough rest in a warm place for 1-2 hours until it doubles in size.

5. When the dough has risen, divide it into six equal portions. Shape each portion into a ball.

6. Preheat a flat griddle or a non-stick pan over medium heat.

7. Take one ball of the dough and flatten it into a circle, about 1/4 inch thick.

8. Place the flattened dough on the hot griddle or pan. Cook for 2-3 minutes until the bottom is golden brown. Flip the Canjeero and cook for another 1-2 minutes until the other side is also golden brown.

9. Remove the Canjeero from the griddle or pan and keep it warm and moist by covering it with a clean towel or aluminum foil.

10. Repeat steps 7-9 with the remaining dough portions.

Canjeero is often served with a stew or soup. Enjoy!
Table 6: Output example with the DPO objective (Eq. 5).
Instruction
I like to host guests at my home from time to time, and I am gathering recipes of different dishes and drinks to keep things interesting. I am interested in trying some Somali dishes. Can you give me a recipe for Canjeero?
Output
Here’s a traditional and delicious Somali recipe for Canjeero (also known as Canjelo or Canjif) - a popular flatbread that’s often enjoyed for breakfast or as a snack. This dish is versatile and can be customized with various toppings to suit your preferences. Here’s how to make it:

Ingredients:
1. 2 cups all-purpose or bread flour
2. 1/2 cup fine semolina (sooji or rava)
3. 1 tsp salt
4. 1 1/2 cups water (or as needed)
5. 2 tbsp vegetable oil (or ghee/clarified butter)
6. Toppings of your choice:
    a. Yogurt
    b. Fresh or dried fruits (bananas, mangoes, dates, raisins)
    c. Honey or sugar
    d. Cardamom powder
    e. Peanuts or other nuts
    f. Qaat (a popular Somali leaf) - optional and not commonly found outside of Somalia

Instructions:

1. In a large mixing bowl, combine the all-purpose or bread flour, semolina, and salt. Mix well to combine.

2. Gradually pour in the water while stirring with a wooden or plastic spoon. Be cautious while adding water, as the dough can become too sticky if excess water is added. Add water until the dough comes together as a smooth ball.

3. Knead the dough for 3-4 minutes until it becomes soft and elastic. Cover the bowl with a damp cloth or plastic wrap and let the dough rest for 10-15 minutes. This allows the gluten to develop, resulting in a fluffier bread.

4. Heat a non-stick or cast-iron skillet on medium-high heat. Add the vegetable oil (or ghee/clarified butter) and let it heat up.

5. Divide the dough into equal-sized balls (usually about 6-8 pieces). Flatten each ball into a disc using your palms. Make sure the disc is not too thin, as this will result in a crispier bread.

6. Once the skillet is hot, carefully place the flattened discs into the skillet. Do not overcrowd the skillet; cook the bread in batches if necessary. Cook the bread for 1-2 minutes on each side, or until golden brown spots appear. Press down lightly on the bread with a spatula to fluff it up.

7. Remove the Canjeero from the skillet and place it on a plate. Repeat the process with the remaining dough balls.

8. Serve the hot Canjeero with your choice of toppings. Here are some popular combinations:
    a. Yogurt and honey or sugar
    b. Fresh fruit and cardamom powder
    c. Qaat and peanuts
    d. Peanut butter and banana
    e. Honey and sliced almonds
    f. Cinnamon and raisins

Note: Canjeero is a versatile dish that can be enjoyed any time of day. It’s often served with tea or coffee as a snack or as part of a larger meal. You can also experiment with different types of flour, such as whole-wheat or spelt, to create unique flavors.

Enjoy your homemade Canjeero! This dish is a staple in Somali cuisine and is a great way to explore the rich flavors and traditions of this vibrant culture. If you have any questions or need further assistance, please don’t hesitate to ask. Happy cooking!
Table 7: Output example with the posterior probability objective (Eq. 10).
Instruction
I like to host guests at my home from time to time, and I am gathering recipes of different dishes and drinks to keep things interesting. I am interested in trying some Somali dishes. Can you give me a recipe for Canjeero?
Output
Certainly! Here’s a simple recipe for Canjeero (Somali flatbread) that serves 4-6 people:

Ingredients:
- 2 cups (240 grams) of Somali flour (also known as “soor” or “sorghum flour”)
- 2 cups (480 ml) of water
- 1/2 teaspoon of salt

Instructions:

1. In a large mixing bowl, combine the somali flour and salt.

2. Gradually add the water to the flour mixture while stirring with a wooden spoon or spatula until the mixture is smooth, free of lumps, and has a thick consistency.

3. Cover the bowl with a clean kitchen towel or plastic wrap and let it rest at room temperature for at least 1 hour (or up to 24 hours) to allow the flour to hydrate and the gluten to develop.

4. When ready to make the canjeero, preheat a non-stick pan or griddle over medium heat.

5. Using a measuring cup or ladle, pour about 1/3 to 1/2 cup of batter into the center of the pan.

6. Immediately, use the back of the ladle or a spatula to gently spread the batter into a thin, round shape (about 6-8 inches in diameter).

7. Let the canjeero cook for 1-2 minutes or until the bottom is golden brown and the top is beginning to set.

8. Flip the canjeero over and let it cook for another 30 seconds to 1 minute on the other side, or until golden brown.

9. Repeat with the remaining batter, using more oil or butter to lubricate the pan as needed.

10. Serve the canjeero warm with a side of ghee (clarified butter) or other Somali stews or sauces like “Wat” or “Xalwo.”

Enjoy your traditional Somali Canjeero!
Table 8: Output example with the likelihood function objective (Eq. 18).