# Direct Preference Optimization (DPO)
- The [[RLHF - Reinforcement Learning with Human Feedback]] setup for aligning LLMs is very cumbersome:
- requires training multiple copies of the LLM for reward and value models
- requires sampling from the LM policy in the training loop (expensive!)
- RL is generally a "last-resort" when the reward is completely blackbox or non-differentiable, but pairwise preference optimization is not at all blackbox and can be defined as a differentiable binary decision, say by assuming [[Bradley-Terry Model]] .
- [DPO](https://arxiv.org/abs/2305.18290) thus proposes a simple binary [[Cross entropy]] loss for fine-tuning against a preference dataset avoiding the need to perform RL with a reward model.
- Results are pretty good: DPO matches or exceeds [[PPO - Proximal Policy Optimization]] based RLHF on alignment tasks.
- Theoretically bulletproof too! Maximum likelihood estimation of Bradley-Terry model learns the same policy as maximizing the reward function learnt from pairwise preferences.
## The DPO Objective
DPO follows the [[Bradley-Terry Model]]'s assumption of sigmoid of reward difference being the predictor of pairwise outcome:
$
p^*\left(y_1 \succ y_2 \mid x\right)=\frac{\exp \left(r^*\left(x, y_1\right)\right)}{\exp \left(r^*\left(x, y_1\right)\right)+\exp \left(r^*\left(x, y_2\right)\right)} .
$
We can then use [[Maximum Likelihood Estimation]] straightforward-ly to estimate the parameters of this reward model $r$.
$
\mathcal{L}_R\left(r_\phi, \mathcal{D}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi\left(x, y_w\right)-r_\phi\left(x, y_l\right)\right)\right]
$
Now following the convention of RL objectives that explicitly prevent diverging too far from a "reference" policy by minimizing a [[KL Divergence]] term (introduced in [[TRPO - Trust-Region Policy Optimization]]), the BT model can directly incorporate this constraint directly as such:
$
p^*\left(y_1 \succ y_2 \mid x\right)=\frac{1}{1+\exp \left(\beta \log \frac{\pi^*\left(y_2 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_2 \mid x\right)}-\beta \log \frac{\pi^*\left(y_1 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_1 \mid x\right)}\right)}
$
where larger deviation from reference policy are discourage by using the idea of important weights (see [[Importance Sampling]]). Correspondingly the loss function is given as:
$
\mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right]
$
Note that $\beta$ is a hyperparameter controlling strength of the "KL penalty".
## The Training Pipeline
The general DPO pipeline is given as:
1. Create preference dataset: Sample completions $y_1, y_2 \sim \pi_{\text {ref }}(\cdot \mid x)$ for every prompt $x$, label with human preferences to construct the offline dataset of preferences $\left.\mathcal{D}=\left\{x^{(i)}, y_w^{(i)}, y_l\right)^{(i)}\right\}_{i=1}^N$
2. To help mitigate issues from [[Distribution Shift]], first maximize the likelihood of preferred completions by next-token prediction (SFT).
3. Then minimize the DPO loss. Default setting for $\beta$ is 0.1.
## DPO vs BT Reward Model
[IPO](https://arxiv.org/abs/2310.12036) paper does a deeper analysis on the behavior of DPO. They show that:
- The more deterministic the preferences are, the strength of the weaker the strength of KL-regularization becomes to the point what value of $\beta$ is used becomes irrelevant.
- In empirical setting, this leads to substantial overfitting and setting higher $\beta$ value does nothing.
A learned reward model (under BT) on the other hand rarely models deterministic preferences due to regularization (under-fitting) and thus does not veer away too far from the reference policy.
However, with BT model it's only the difference that matters, which creates two properties:
- Rewards are unbounded: To push p(a > b) closer to 1, the model just needs to make the difference larger and there is no ceiling on the individual reward values.
- Rewards are shift invariant: Absolute scale is arbitrary as adding constant doesn't affect the probs.
This unboundedness causes:
- Reward Hacking
- Model finds inputs that produce extreme reward values even if they don't reflect true quality.
- Can add a penalty term to BT loss that penalizes squared difference i.e. $\mathcal{L}=-\log \sigma\left(r_A-r_B\right)+\lambda\left(r_A-r_B\right)^2$ (which is a zero-centered gaussian prior on rewards).
- IPO bakes in regularization directly into the objective.
- Instability
- Unbounded rewards cause large gradients and unstable training.
- Clamp difference to some range like [-10, 10] before applying sigmoid. Simple but can create flat gradients at the boundaries.
- Compute advantage that don't care about the magnitude of the reward, just ranking among samples.
## IPO
[IPO](https://arxiv.org/abs/2310.12036) objective ensures the regularization towards reference policy is always maintained and thus avoid over-fitting to the preference dataset.
- Drop BT's sigmoid of difference assumption, just learn to separate winners by a margin.
- $\mathcal{L}_{I P O}=\left(\log \frac{\pi\left(y_w \mid x\right)}{\pi_{r e f}\left(y_w \mid x\right)}-\log \frac{\pi\left(y_l \mid x\right)}{\pi_{r e f}\left(y_l \mid x\right)}-\frac{1}{2 \beta}\right)^2$
- Make the difference equal to a target margin $(1 / 2 \beta)$, then stop. Since there's no sigmoid asymptote to chase, the objective naturally saturates once the margin is achieved.
- But looses probabilistic interpretation.
## IPO/DPO vs RLHF
IPO/DPO are offline. They learn from a fixed preference dataset. RLHF is online, it generates new samples, scores them, and updates. So exploration beyond original dataset is a huge win for RLHF, *if* there is something coherent to explore towards.
IPO and DPO might be better when:
- Simple pipeline is preferred
- Less compute, its just's straight-forward supervised learning
- Preference data is high quality and has good coverage