PPO - Proximal Policy Optimization

# PPO - Proximal Policy Optimization Paper: https://arxiv.org/pdf/1707.06347.pdf Code example: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb RL suffers from high sentivity to hyperparams and noisy estimates that may lead the agent to learn bad policy from which it may never recover. PPO is a popular stable deep RL algorithm. Improves upon [[TRPO - Trust-Region Policy Optimization]]. PPO benefits: - Simpler to implement - Sample efficient (empirically) - Ease of tuning Policy gradient methods typically less sample efficient than Q- learning methods as they learn online, doesn't use replay buffer to store past experiences. The regular policy gradient loss is given as, $ L^{P G}(\theta)=\hat{E}_{t}\left[\log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \hat{A}_{t}\right] $ Advantage function tires to estimate relative value of selected action in the selected state. It requires: 1. Discounted rewards - weighted sum of all the rewards $\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}$ 2. Baseline estimate (value function) - estimate of the discounted return from this point onward - state -> value function neural net -> return estimate - Advantage estimate = discounted rewards - baseline estimate - Positive advantage better than average returns -> gradient is positive, increase these action probabilities - Negative advantage -> gradient is negative -> reduce the likelihood of those actions The problem with this regular objective is that If you keep running gradient descent on a single batch, advantage estimate eventually goes wrong and you will end up destroying your policy. A constrained optimization solution was provided by [[TRPO - Trust-Region Policy Optimization]] with "trust regions" - don't move too far from the old policy. It is enforced by a [[KL Divergence]] constraint. Then policy gradient loss becomes: $ \underset{\theta}{\operatorname{maximize}} \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\text {old }}}\left(a_{t} \mid s_{t}\right)} \hat{A}_{t}-\beta \operatorname{KL}\left[\pi_{\theta_{\text {old }}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right] $ This constraint adds overhead and sometimes leads to some undesirable training behaviour. PPO cleverly modifies the TRPO objective. First, TRPO loss can be written as $ L^{C P I}(\theta)=\hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\mathrm{old}}}\left(a_{t} \mid s_{t}\right)} \hat{A}_{t}\right]=\hat{\mathbb{E}}_{t}\left[r_{t}(\theta) \hat{A}_{t}\right] $ which leads to PPO objective: $ L^{C L I P}(\theta)= \hat{E}_{t}\left[\min \left(r_{t}(\theta) \hat{A}_{t}, \operatorname{clip}\left(r_{t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_{t}\right)\right] $ - $\theta$ is the policy parameter - $\hat{E}_{t}$ denotes the empirical expectation over timesteps - $r_{t}$ is the ratio of the probability under the new and old policies, respectively - $\hat{A}_{t}$ is the estimated advantage at time $t$ - $\varepsilon$ is a hyperparameter, usually 0.1 or 0.2 This effectively takes the minimum of normal policy gradient objective and the clipped version of policy gradient objective. Things to understand: - motivation is that advantage function is noisy, so we don't want to destroy our policy based on a single estimate - advantage value can be negative and positive, so min value behaves differently as shown in the diagram ![[ppo-min.jpg]] - when action was good i.e A >0, flatten out return so as to not to overdo the gradient update - when action was bad i.e. A<0, don't keep reducing it's likelihood too much - **difference with TRPO is PPO shows that there is no need to constraint using KL divergence** with its simpler objective function ![[ppo-algo.jpg]] The final loss function that is (approximately) maximized each iteration is: $ L_{t}^{C L I P+V F+S}(\theta)=\hat{\mathbb{E}}_{t}\left[L_{t}^{C L I P}(\theta)-c_{1} L_{t}^{V F}(\theta)+c_{2} S\left[\pi_{\theta}\right]\left(s_{t}\right)\right] $ The second term updates the baseline (value function) network. Last term is the entropy term to encourage exploration. ## References 1. Arxiv Insights video: https://www.youtube.com/watch?v=5P7I-xPq8u8 2. Open AI's PPO blogpost: https://openai.com/blog/openai-baselines-ppo/