Direct Preference Optimization (DPO)

# Direct Preference Optimization (DPO) - The [[RLHF - Reinforcement Learning with Human Feedback]] setup for aligning LLMs is very cumbersome: - requires training multiple copies of the LLM for reward and value models - requires sampling from the LM policy in the training loop (expensive!) - RL is generally a "last-resort" when the reward is completely blackbox or non-differentiable, but pairwise preference optimization is not at all blackbox and can be defined as a differentiable binary decision, say by assuming [[Bradley-Terry Model]] . - [DPO](https://arxiv.org/abs/2305.18290) thus proposes a simple binary [[Cross entropy]] loss for fine-tuning against a preference dataset avoiding the need to perform RL with a reward model. - Results are pretty good: DPO matches or exceeds [[PPO - Proximal Policy Optimization]] based RLHF on alignment tasks. - Theoretically bulletproof too! Maximum likelihood estimation of Bradley-Terry model learns the same policy as maximizing the reward function learnt from pairwise preferences. ## The DPO Objective DPO follows the [[Bradley-Terry Model]]'s assumption of sigmoid of reward difference being the predictor of pairwise outcome: $ p^*\left(y_1 \succ y_2 \mid x\right)=\frac{\exp \left(r^*\left(x, y_1\right)\right)}{\exp \left(r^*\left(x, y_1\right)\right)+\exp \left(r^*\left(x, y_2\right)\right)} . $ We can then use [[Maximum Likelihood Estimation]] straightforward-ly to estimate the parameters of this reward model $r$. $ \mathcal{L}_R\left(r_\phi, \mathcal{D}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi\left(x, y_w\right)-r_\phi\left(x, y_l\right)\right)\right] $ Now following the convention of RL objectives that explicitly prevent diverging too far from a "reference" policy by minimizing a [[KL Divergence]] term (introduced in [[TRPO - Trust-Region Policy Optimization]]), the BT model can directly incorporate this constraint directly as such: $ p^*\left(y_1 \succ y_2 \mid x\right)=\frac{1}{1+\exp \left(\beta \log \frac{\pi^*\left(y_2 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_2 \mid x\right)}-\beta \log \frac{\pi^*\left(y_1 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_1 \mid x\right)}\right)} $ where larger deviation from reference policy are discourage by using the idea of important weights (see [[Importance Sampling]]). Correspondingly the loss function is given as: $ \mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right] $ Note that $\beta$ is a hyperparameter controlling strength of the "KL penalty". ## The Training Pipeline The general DPO pipeline is given as: 1. Create preference dataset: Sample completions $y_1, y_2 \sim \pi_{\text {ref }}(\cdot \mid x)$ for every prompt $x$, label with human preferences to construct the offline dataset of preferences $\left.\mathcal{D}=\left\{x^{(i)}, y_w^{(i)}, y_l\right)^{(i)}\right\}_{i=1}^N$ 2. To help mitigate issues from [[Distribution Shift]], first maximize the likelihood of preferred completions by next-token prediction (SFT). 3. Then minimize the DPO loss. Default setting for $\beta$ is 0.1. ## DPO vs BT Reward Model [IPO](https://arxiv.org/abs/2310.12036) paper does a deeper analysis on the behavior of DPO. They show that: - The more deterministic the preferences are, the strength of the weaker the strength of KL-regularization becomes to the point what value of $\beta$ is used becomes irrelevant. - In empirical setting, this leads to substantial overfitting and setting higher $\beta$ value does nothing. A learned reward model (under BT) on the other hand rarely models deterministic preferences due to regularization (under-fitting) and thus does not veer away too far from the reference policy. However, with BT model it's only the difference that matters, which creates two properties: - Rewards are unbounded: To push p(a > b) closer to 1, the model just needs to make the difference larger and there is no ceiling on the individual reward values. - Rewards are shift invariant: Absolute scale is arbitrary as adding constant doesn't affect the probs. This unboundedness causes: - Reward Hacking - Model finds inputs that produce extreme reward values even if they don't reflect true quality. - Can add a penalty term to BT loss that penalizes squared difference i.e. $\mathcal{L}=-\log \sigma\left(r_A-r_B\right)+\lambda\left(r_A-r_B\right)^2$ (which is a zero-centered gaussian prior on rewards). - IPO bakes in regularization directly into the objective. - Instability - Unbounded rewards cause large gradients and unstable training. - Clamp difference to some range like [-10, 10] before applying sigmoid. Simple but can create flat gradients at the boundaries. - Compute advantage that don't care about the magnitude of the reward, just ranking among samples. ## IPO [IPO](https://arxiv.org/abs/2310.12036) objective ensures the regularization towards reference policy is always maintained and thus avoid over-fitting to the preference dataset. - Drop BT's sigmoid of difference assumption, just learn to separate winners by a margin. - $\mathcal{L}_{I P O}=\left(\log \frac{\pi\left(y_w \mid x\right)}{\pi_{r e f}\left(y_w \mid x\right)}-\log \frac{\pi\left(y_l \mid x\right)}{\pi_{r e f}\left(y_l \mid x\right)}-\frac{1}{2 \beta}\right)^2$ - Make the difference equal to a target margin $(1 / 2 \beta)$, then stop. Since there's no sigmoid asymptote to chase, the objective naturally saturates once the margin is achieved. - But looses probabilistic interpretation. ## IPO/DPO vs RLHF IPO/DPO are offline. They learn from a fixed preference dataset. RLHF is online, it generates new samples, scores them, and updates. So exploration beyond original dataset is a huge win for RLHF, *if* there is something coherent to explore towards. IPO and DPO might be better when: - Simple pipeline is preferred - Less compute, its just's straight-forward supervised learning - Preference data is high quality and has good coverage