# RLHF - Reinforcement Learning with Human Feedback RLHF is a method of using a scalar reward value as a fine-tuning mechanism. Since the reward value can take any form, for example, human ranking or final score in a game, it is usually not differentiable and thus can't be used as a loss function. Therefore, we use [[Reinforcement Learning]]. Specifically, a stable [[Policy Gradient]] methods such as [[PPO - Proximal Policy Optimization]], as reinforcement learning can be notoriously unstable. RLHF is a variation to the concept of using [[Inverse Reinforcement Learning]] to learn a reward model from human preferences/ranking and then using [[Apprenticeship Learning]], tuning the original model with the learned reward model. The term RLHF generally applies in context of [[Transformers]] language models such as [[BERT]]-like models or GPTs. The general steps for RLHF fine-tuning are as followed: 1. Fine-tune a pretrained model using supervised learning. This is our supervised fine-tuned model (SFT). 2. Initialize a new *reward model* (RM) from SFT with a new regression head (linear layer). This model takes in the input to the SFT model and it's output, and predicts a reward scalar. Fine-tune this model on a reward function/dataset. 3. Now use RM to score every output of SFT, and use [[PPO - Proximal Policy Optimization]] or any stable policy gradient algorithm to update the parameters of SFT, such as [[TRPO - Trust-Region Policy Optimization]]. ![[Learning to summarize from human feedback.png]] ## Training the reward model The reward model (RM) is simply a copy of fine-tuned model (SFT), but with a new linear layer added on top to predict the reward. The input to the reward model is the original input, plus the output generated by the SFT model. ### Modeling the reward Reward dataset - If we have access to scalar rewards as a dataset, we can simply train the reward model as a regression model to predict these rewards with a MSE error. Pairwise dataset - If we have a dataset in which an input has two outputs and a human preference is given (i.e. similar to the dataset used in paper [1]), we can formulate the loss as negative log of sigmoid of the difference in reward output by the reward model for each of the these two outputs. Reward function - If we can use a deterministic function, for example negative rewards if output contains profanity, we can simply use these functions as to calculate the ground truth reward on the fly, and train reward model with MSE error. ## Using reinforcement learning for learning fine-tuning [[Reinforcement Learning]] and specially [[Policy Gradient]] methods are inherently noisy, especially in the beginning. To deal with this unstability, a couple of methods may be useful: ### Reward normalization Subtract the ground truth reward from the trained reward model rewards. ### KL Divergence A common problem in policy gradient is that we do not want our policy to drastically change during each update, we want it to remain inside the "Trust Region" as specified in the [[TRPO - Trust-Region Policy Optimization]] paper. To do this, we can add a KL term in the reward function as a regularization. The exact formulation as stated in [2] and implemented by [trlx](https://github.com/CarperAI/trlx), is: $ \mathrm{R}(\mathrm{x}, \mathrm{y})=\mathrm{r}(\mathrm{x}, \mathrm{y})-\beta \log \left[\frac{\pi^{\mathrm{RL}}(\mathrm{y} \mid \mathrm{x})}{\pi^{\mathrm{SFT}}(\mathrm{y} \mid \mathrm{x})}\right] $ --- ## References 1. Fine-Tuning Language Models from Human Preferences, Ziegler et al. https://arxiv.org/abs/1909.08593 2. RLHF tutorial https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#dataset