Natural Policy Gradient

# Natural Policy Gradient Small changes in network parameters$\boldsymbol{\theta}$ might result in large changes of the resulting distribution $p_{\boldsymbol{\theta}}(\boldsymbol{y})$. As the gradient estimation typically depends on $p_{\boldsymbol{\theta}}(\boldsymbol{y})$ due to the sampling process, the next gradient estimate can also change dramatically. To achieve a stable behavior of the learning process, it is desirable to enforce that the distribution $p_{\boldsymbol{\theta}}(\boldsymbol{y})$ does not change too much in one update step. This intuition is the key concept behind the natural gradient which limits the distance between the distribution $p_{\boldsymbol{\theta}}(\boldsymbol{y})$ before and $p_{\boldsymbol{\theta}+\delta \boldsymbol{\theta}}(\boldsymbol{y})$ after the update. To measure the distance between $p_{\boldsymbol{\theta}}(\boldsymbol{y})$ and $p_{\boldsymbol{\theta}+\delta \boldsymbol{\theta}}(\boldsymbol{y})$, the natural gradient uses an approximation of the [[KL Divergence]], which can be represented as Fisher information matrix: $ \boldsymbol{F}_{\boldsymbol{\theta}}=\mathbb{E}_{p(\boldsymbol{y})}\left[\nabla_{\boldsymbol{\theta}} \log p(\boldsymbol{y}) \nabla_{\boldsymbol{\theta}} \log p(\boldsymbol{y})^{T}\right] $ can be used to approximate the KL divergence between $p_{\boldsymbol{\theta}+\delta \boldsymbol{\theta}}(\boldsymbol{y})$ and $p_{\boldsymbol{\theta}}(\boldsymbol{y})$ for sufficiently small $\delta \boldsymbol{\theta}$, i.e., $ \operatorname{KL}\left(p_{\boldsymbol{\theta}+\delta \boldsymbol{\theta}}(\boldsymbol{y}) \| p_{\boldsymbol{\theta}}(\boldsymbol{y})\right) \approx \delta \boldsymbol{\theta}^{T} \boldsymbol{F}_{\boldsymbol{\theta}} \delta \boldsymbol{\theta} $ The natural gradient update $\delta \boldsymbol{\theta}^{\mathrm{NG}}$ is now defined as the update $\delta \boldsymbol{\theta}$ that is the most similar to the traditional "vanilla" gradient $\delta \boldsymbol{\theta}^{\mathrm{VG}}$ update that has a bounded distance $ \mathrm{KL}\left(p_{\boldsymbol{\theta}+\delta \boldsymbol{\theta}}(\boldsymbol{y}) \| p_{\boldsymbol{\theta}}(\boldsymbol{y})\right) \leq \epsilon $ in the distribution space. Hence, we can formulate the following optimization program $ \delta \boldsymbol{\theta}^{N G}=\operatorname{argmax}_{\delta \boldsymbol{\theta}} \delta \boldsymbol{\theta}^{T} \delta \boldsymbol{\theta}^{\mathrm{VG}} \quad \text { s.t. } \quad \delta \boldsymbol{\theta}^{T} \mathbf{F}_{\boldsymbol{\theta}} \delta \boldsymbol{\theta} \leq \varepsilon $ The natural gradient linearly transforms the traditional gradient by the inverse Fisher matrix, which renders the parameter update also invariant to linear transformations of the parameters of the distribution, i.e., if two parametrizations have the same representative power, the natural gradient update will be identical. As the Fisher information matrix is always positive definite, the natural gradient always rotates the traditional gradient by less than 90 degrees, and, hence, all convergence guarantees from standard gradient-based optimization remain. The Fisher information matrix $ \boldsymbol{F}_{\boldsymbol{\theta}}=\mathbb{E}_{p_{\boldsymbol{\theta}}(\tau)}\left[\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\boldsymbol{\tau}) \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\boldsymbol{\tau})^{T}\right] $ is now computed for the trajectory distribution $p_{\boldsymbol{\theta}}(\boldsymbol{\tau})$. The natural policy gradient $\nabla_{\boldsymbol{\theta}}^{\mathrm{NG}} J_{\boldsymbol{\theta}}$ is therefore given by $ \nabla_{\boldsymbol{\theta}}^{\mathrm{NG}} J_{\boldsymbol{\theta}}=\boldsymbol{F}_{\boldsymbol{\theta}}^{-1} \nabla_{\boldsymbol{\theta}} J_{\boldsymbol{\theta}} $ where $\nabla_{\theta} J_{\theta}$ can be estimated by any traditional policy gradient method. --- ## References 1.