# PPO - Proximal Policy Optimization
Paper: https://arxiv.org/pdf/1707.06347.pdf
Code: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb
RL suffers from high sentivity to hyperparams and noisy estimates that may lead the agent to learn bad policy from which it may never recover.
PPO is a popular stable deep RL algorithm. Improves upon [[TRPO - Trust-Region Policy Optimization]].
PPO benefits:
- Simpler to implement
- Sample efficient (empirically)
- Ease of tuning
Policy gradient methods typically less sample efficient than Q- learning methods as they learns online, doesn't use replay buffer to store past experiences.
The regular policy gradient loss is given as,
$
L^{P G}(\theta)=\hat{E}_{t}\left[\log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \hat{A}_{t}\right]
$
Advantage function tires to estimate relative value of selected action in the selected state. It requires:
1. Discounted rewards
- weighter sum of all the rewards $\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}$
2. Baseline estimate (value function)
- estimate of the discounted return from this point onward
- state -> value function neural net -> return estimate
- Advantage estimate = discounted rewards - baseline estimate
- Positive advantage better than average returns -> gradient is positive, increase these action probabilities
- Negative advantage -> gradient is negative -> reduce the likelihood of those actions
The problem with this regular objective is that If you keep running gradient descent on a single batch, advantage estimate eventually goes wrong and you will end up destroying your policy.
A constrained optimization solution was provided by [[TRPO - Trust-Region Policy Optimization]] with "trust regions" - dont move too far from the old policy. It is enforced by a [[KL Divergence]] constraint. Then policy gradient loss becomes:
$
\underset{\theta}{\operatorname{maximize}} \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\text {old }}}\left(a_{t} \mid s_{t}\right)} \hat{A}_{t}-\beta \operatorname{KL}\left[\pi_{\theta_{\text {old }}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]
$
This contraint adds overhead and sometimes leads to some undesirable training behaviour.
PPO cleverly modifies the TRPO objective. First, TRPO loss can be written as
$
L^{C P I}(\theta)=\hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\mathrm{old}}}\left(a_{t} \mid s_{t}\right)} \hat{A}_{t}\right]=\hat{\mathbb{E}}_{t}\left[r_{t}(\theta) \hat{A}_{t}\right]
$
which leads to PPO objective:
$
L^{C L I P}(\theta)= \hat{E}_{t}\left[\min \left(r_{t}(\theta) \hat{A}_{t}, \operatorname{clip}\left(r_{t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_{t}\right)\right]
$
- $\theta$ is the policy parameter
- $\hat{E}_{t}$ denotes the empirical expectation over timesteps
- $r_{t}$ is the ratio of the probability under the new and old policies, respectively
- $\hat{A}_{t}$ is the estimated advantage at time $t$
- $\varepsilon$ is a hyperparameter, usually 0.1 or 0.2
This effectively takes the minimum of normal policy gradient objective and the clipped version of policy gradient objective.
Things to understand:
- motivation is that advantage function is noisy, so we don't want to destroy our policy based on a single estimate
- advantage value can be negative and positive, so min value behaves differently as shown in the diagram
![[ppo-min.jpg]]
- when action was good i.e A >0, flatten out return so as to not to overdo the gradient update
- when action was bad i.e. A<0, don't keep reducing it's likelihood too much
- difference with TRPO is PPO shows that there is no need to constraint using KL divergence with its simpler objective function
![[ppo-algo.jpg]]
The final loss function that is (approximately) maximized each iteration is:
$
L_{t}^{C L I P+V F+S}(\theta)=\hat{\mathbb{E}}_{t}\left[L_{t}^{C L I P}(\theta)-c_{1} L_{t}^{V F}(\theta)+c_{2} S\left[\pi_{\theta}\right]\left(s_{t}\right)\right]
$
The second term updates the baseline (value function) network.
Last term is the entropy term to encourage exploration.
---
## References
1. Arxiv Insights video: https://www.youtube.com/watch?v=5P7I-xPq8u8
2. Open AI's PPO blogpost: https://openai.com/blog/openai-baselines-ppo/