# Direct Preference Optimization (DPO)
- The [[RLHF - Reinforcement Learning with Human Feedback]] setup for aligning LLMs is very cumbersome:
- requires training multiple copies of the LLM for reward and value models
- requires sampling from the LM policy in the training loop (expensive!)
- RL is generally a "last-resort" when the reward is completely blackbox or non-differentiable, but pairwise preference optimization is not at all blackbox and can be defined as a differentiable binary decision, say by assuming [[Bradley-Terry Model]] .
- [DPO](https://arxiv.org/abs/2305.18290) thus proposes a simple binary [[Cross entropy]] loss for fine-tuning against a preference dataset avoiding the need to perform RL with a reward model.
- Results are pretty good: DPO matches or exceeds [[PPO - Proximal Policy Optimization]] based RLHF on alignment tasks.
## The DPO Objective
DPO follows the [[Bradley-Terry Model]]'s assumption of sigmoid of reward difference being the predictor of pairwise outcome:
$
p^*\left(y_1 \succ y_2 \mid x\right)=\frac{\exp \left(r^*\left(x, y_1\right)\right)}{\exp \left(r^*\left(x, y_1\right)\right)+\exp \left(r^*\left(x, y_2\right)\right)} .
$
We can then use [[Maximum Likelihood Estimation]] straightforward-ly to estimate the parameters of this reward model $r$.
$
\mathcal{L}_R\left(r_\phi, \mathcal{D}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi\left(x, y_w\right)-r_\phi\left(x, y_l\right)\right)\right]
$
Now following the convention of RL objectives that explicitly prevent diverging too far from a "reference" policy by minimizing a [[KL Divergence]] term (introduced in [[TRPO - Trust-Region Policy Optimization]]), the BT model can directly incorporate this constraint directly as such:
$
p^*\left(y_1 \succ y_2 \mid x\right)=\frac{1}{1+\exp \left(\beta \log \frac{\pi^*\left(y_2 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_2 \mid x\right)}-\beta \log \frac{\pi^*\left(y_1 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_1 \mid x\right)}\right)}
$
where larger deviation from reference policy are discourage by using the idea of important weights (see [[Importance Sampling]]). Correspondingly the loss function is given as:
$
\mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right]
$
Note that $\beta$ is a hyperparameter controlling strength of the "KL penalty".
## The Training Pipeline
The general DPO pipeline is given as:
1. Create preference dataset: Sample completions $y_1, y_2 \sim \pi_{\text {ref }}(\cdot \mid x)$ for every prompt $x$, label with human preferences to construct the offline dataset of preferences $\left.\mathcal{D}=\left\{x^{(i)}, y_w^{(i)}, y_l\right)^{(i)}\right\}_{i=1}^N$
2. To help mitigate issues from [[Distribution Shift]], first maximize the likelihood of preferred completions by next-token prediction (SFT).
3. Then minimize the DPO loss. Default setting for $\beta$ is 0.1.