# Direct Preference Optimization (DPO) - The [[RLHF - Reinforcement Learning with Human Feedback]] setup for aligning LLMs is very cumbersome: - requires training multiple copies of the LLM for reward and value models - requires sampling from the LM policy in the training loop (expensive!) - RL is generally a "last-resort" when the reward is completely blackbox or non-differentiable, but pairwise preference optimization is not at all blackbox and can be defined as a differentiable binary decision, say by assuming [[Bradley-Terry Model]] . - [DPO](https://arxiv.org/abs/2305.18290) thus proposes a simple binary [[Cross entropy]] loss for fine-tuning against a preference dataset avoiding the need to perform RL with a reward model. - Results are pretty good: DPO matches or exceeds [[PPO - Proximal Policy Optimization]] based RLHF on alignment tasks. ## The DPO Objective DPO follows the [[Bradley-Terry Model]]'s assumption of sigmoid of reward difference being the predictor of pairwise outcome: $ p^*\left(y_1 \succ y_2 \mid x\right)=\frac{\exp \left(r^*\left(x, y_1\right)\right)}{\exp \left(r^*\left(x, y_1\right)\right)+\exp \left(r^*\left(x, y_2\right)\right)} . $ We can then use [[Maximum Likelihood Estimation]] straightforward-ly to estimate the parameters of this reward model $r$. $ \mathcal{L}_R\left(r_\phi, \mathcal{D}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi\left(x, y_w\right)-r_\phi\left(x, y_l\right)\right)\right] $ Now following the convention of RL objectives that explicitly prevent diverging too far from a "reference" policy by minimizing a [[KL Divergence]] term (introduced in [[TRPO - Trust-Region Policy Optimization]]), the BT model can directly incorporate this constraint directly as such: $ p^*\left(y_1 \succ y_2 \mid x\right)=\frac{1}{1+\exp \left(\beta \log \frac{\pi^*\left(y_2 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_2 \mid x\right)}-\beta \log \frac{\pi^*\left(y_1 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_1 \mid x\right)}\right)} $ where larger deviation from reference policy are discourage by using the idea of important weights (see [[Importance Sampling]]). Correspondingly the loss function is given as: $ \mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right] $ Note that $\beta$ is a hyperparameter controlling strength of the "KL penalty". ## The Training Pipeline The general DPO pipeline is given as: 1. Create preference dataset: Sample completions $y_1, y_2 \sim \pi_{\text {ref }}(\cdot \mid x)$ for every prompt $x$, label with human preferences to construct the offline dataset of preferences $\left.\mathcal{D}=\left\{x^{(i)}, y_w^{(i)}, y_l\right)^{(i)}\right\}_{i=1}^N$ 2. To help mitigate issues from [[Distribution Shift]], first maximize the likelihood of preferred completions by next-token prediction (SFT). 3. Then minimize the DPO loss. Default setting for $\beta$ is 0.1.