TRPO - Trust-Region Policy Optimization

# TRPO - Trust-Region Policy Optimization In [[Policy Gradient]] methods, it is Important to be careful with updates: a bad policy leads to bad data. This is different from supervised learning (where learning and data are independent). One solution: regularise policy to not change too much to prevent instability. A popular method is to limit the difference between subsequent policies. For instance, use the Kullbeck-Leibler divergence: $ \mathrm{KL}\left(\pi_{\mathrm{old}} \| \pi_{\theta}\right)=\mathbb{E}\left[\int \pi_{\mathrm{old}}(a \mid S) \log \frac{\pi_{\theta}(a \mid S)}{\pi_{\mathrm{old}}(a \mid S)} \mathrm{da}\right] $ Then maximise $J(\theta)-\eta \mathrm{KL}\left(\pi_{\text {old }} \| \pi_{\theta}\right),$ for some small $\eta$ TRPO is guarenteed to give monotonic improvements. It can also help to use large batches. Regular policy gradient objective: $ L^{P G}(\theta)=\hat{E}_{t}\left[\log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \hat{A}_{t}\right] $ TRPO objective $ \begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta} \text { old }, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\text {old }}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho \theta_{\text {old }}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text {old }}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} $ - Add a constraint to not deviate too much away from the current policy and stay in the "trust-region" - Hard to choose a penalty coefficient that works for different tasks, so implemented as a hard constraint instead. --- ## References 1. Trust Region Policy Optimization https://arxiv.org/pdf/1502.05477.pdf