# TRPO - Trust-Region Policy Optimization
In [[Policy Gradient]] methods, it is Important to be careful with updates: a bad policy leads to bad data. This is different from supervised learning (where learning and data are independent).
One solution: regularise policy to not change too much to prevent instability. A popular method is to limit the difference between subsequent policies.
For instance, use the Kullbeck-Leibler divergence:
$
\mathrm{KL}\left(\pi_{\mathrm{old}} \| \pi_{\theta}\right)=\mathbb{E}\left[\int \pi_{\mathrm{old}}(a \mid S) \log \frac{\pi_{\theta}(a \mid S)}{\pi_{\mathrm{old}}(a \mid S)} \mathrm{da}\right]
$
Then maximise $J(\theta)-\eta \mathrm{KL}\left(\pi_{\text {old }} \| \pi_{\theta}\right),$ for some small $\eta$
TRPO is guarenteed to give monotonic improvements. It can also help to use large batches.
Regular policy gradient objective:
$
L^{P G}(\theta)=\hat{E}_{t}\left[\log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \hat{A}_{t}\right]
$
TRPO objective
$
\begin{array}{l}
\underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta} \text { old }, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\text {old }}}(s, a)\right] \\
\text { subject to } \mathbb{E}_{s \sim \rho \theta_{\text {old }}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text {old }}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta
\end{array}
$
- Add a constraint to not deviate too much away from the current policy and stay in the "trust-region"
- Hard to choose a penalty coefficient that works for different tasks, so implemented as a hard constraint instead.
---
## References
1. Trust Region Policy Optimization https://arxiv.org/pdf/1502.05477.pdf