# Generalized Advantage Estimate ## The problem - Data efficiency is a problem as large number of samples are required as the variance is high. - Stable and steady training is difficult because non-stationarity of the incoming data. ## The solution - Can use value functions as baselines at cost of some bias. - Can be dealt with to some degree by preventing the policy to change drastically by small number of samples with [[TRPO - Trust-Region Policy Optimization]]. ## The details This paper considers algorithms that optimize a parameterized policy and use value functions to help estimate how the policy should be improved. Policy gradient methods maximize the expected total reward by repeatedly estimating the gradient $g:=\nabla_{\theta} \mathbb{E}\left[\sum_{t=0}^{\infty} r_{t}\right]$. There are several different related expressions for the policy gradient, which have the form $ g=\mathbb{E}\left[\sum_{t=0}^{\infty} \Psi_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] $ where $\Psi_{t}$ may be one of the following: 1. $\sum_{t=0}^{\infty} r_{t}:$ total reward of the trajectory. 2. $Q^{\pi}\left(s_{t}, a_{t}\right)$ : state-action value function. 3. $\sum_{t^{\prime}=t}^{\infty} r_{t^{\prime}}:$ reward following action $a_{t}$. 4. $A^{\pi}\left(s_{t}, a_{t}\right):$ advantage function. 5. $\sum_{t^{\prime}=t}^{\infty} r_{t^{\prime}}-b\left(s_{t}\right):$ baselined version of previous formula. 6. $r_{t}+V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right):$ TD residual. The latter formulas use the definitions $ \begin{aligned} V^{\pi}\left(s_{t}\right) &:=\mathbb{E} s_{t+1: \infty},\left[\sum_{l=0}^{\infty} r_{t+l}\right] \quad Q^{\pi}\left(s_{t}, a_{t}\right):=\mathbb{E}_{t+1: \infty},\left[\sum_{l=0}^{\infty} r_{t+l}\right] \\ A^{\pi}\left(s_{t}, a_{t}\right) &:=Q^{\pi}\left(s_{t}, a_{t}\right)-V^{\pi}\left(s_{t}\right), \quad \text { (Advantage function) } \end{aligned} $ Here, the subscript of $\mathbb{E}$ enumerates the variables being integrated over, where states and actions are sampled sequentially from the dynamics model $P\left(s_{t+1} \mid s_{t}, a_{t}\right)$ and policy $\pi\left(a_{t} \mid s_{t}\right)$, respectively. The advantage function, by it's definition $A^{\pi}(s, a)=Q^{\pi}(s, a)-V^{\pi}(s)$, measures whether or not the action is better or worse than the policy's default behavior. Hence, we should choose $\Psi_{t}$ to be the advantage function $A^{\pi}\left(s_{t}, a_{t}\right)$, so that the gradient term $\Psi_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)$ points in the direction of increased $\pi_{\theta}\left(a_{t} \mid s_{t}\right)$ if and only if $A^{\pi}\left(s_{t}, a_{t}\right)>0$. See Greensmith et al. We will introduce a parameter $\gamma$ that allows us to reduce variance by downweighting rewards cor- responding to delayed effects, at the cost of introducing bias. This parameter corresponds to the discount factor used in discounted formulations of MDPs, but we treat it as a variance reduction parameter in an undiscounted problem. The main experimental validation of generalized advantage estimation is in the domain of simulated robotic locomotion. As shown in our experiments, choosing an appropriate intermediate value of $\lambda$ in the range $[0.9,0.99]$ usually results in the best performance. A possible topic for future work is how to adjust the estimator parameters $\gamma, \lambda$ in an adaptive or automatic way. $ \begin{aligned} V^{\pi, \gamma}\left(s_{t}\right) &:=\mathbb{E} s_{t+1: \infty},\left[\sum_{l=0}^{\infty} \gamma^{l} r_{t+l}\right] \quad Q^{\pi, \gamma}\left(s_{t}, a_{t}\right):=\mathbb{E} s_{t+1: \infty},\left[\sum_{l=0}^{\infty} \gamma^{l} r_{t+l}\right] \\ A^{\pi, \gamma}\left(s_{t}, a_{t}\right) &:=Q^{\pi, \gamma}\left(s_{t}, a_{t}\right)-V^{\pi, \gamma}\left(s_{t}\right) \end{aligned} $ The discounted approximation to the policy gradient is defined as follows: $ g^{\gamma}:=\mathbb{E}_{s_{0: \infty}}{s_{0: \infty}}\left[\sum_{t=0}^{\infty} A^{\pi, \gamma}\left(s_{t}, a_{t}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] $ Next, let us consider taking the sum of $k$ of these $\delta$ terms, which we will denote by $\hat{A}_{t}^{(k)}$ $ \begin{aligned} &\hat{A}_{t}^{(1)}:=\delta_{t}^{V} \quad=-V\left(s_{t}\right)+r_{t}+\gamma V\left(s_{t+1}\right) \\ &\hat{A}_{t}^{(2)}:=\delta_{t}^{V}+\gamma \delta_{t+1}^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\gamma^{2} V\left(s_{t+2}\right) \\ &\hat{A}_{t}^{(3)}:=\delta_{t}^{V}+\gamma \delta_{t+1}^{V}+\gamma^{2} \delta_{t+2}^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\gamma^{2} r_{t+2}+\gamma^{3} V\left(s_{t+3}\right) \\ &\hat{A}_{t}^{(k)}:=\sum_{l=0}^{k-1} \gamma^{l} \delta_{t+l}^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\cdots+\gamma^{k-1} r_{t+k-1}+\gamma^{k} V\left(s_{t+k}\right) \end{aligned} $ These equations result from a telescoping sum, and we see that $\hat{A}_{t}^{(k)}$ involves a $k$-step estimate of the returns, minus a baseline term $-V\left(s_{t}\right) .$ Analogously to the case of $\delta_{t}^{V}=\hat{A}_{t}^{(1)}$, we can consider $\hat{A}_{t}^{(k)}$ to be an estimator of the advantage function, which is only $\gamma$-just when $V=V^{\pi, \gamma}$. However, note that the bias generally becomes smaller as $k \rightarrow \infty$, since the term $\gamma^{k} V\left(s_{t+k}\right)$ becomes more heavily discounted, and the term $-V\left(s_{t}\right)$ does not affect the bias. Taking $k \rightarrow \infty$, we get $ \hat{A}_{t}^{(\infty)}=\sum_{l=0}^{\infty} \gamma^{l} \delta_{t+l}^{V}=-V\left(s_{t}\right)+\sum_{l=0}^{\infty} \gamma^{l} r_{t+l} $ which is simply the empirical returns minus the value function baseline. The generalized advantage estimator $\operatorname{GAE}(\gamma, \lambda)$ is defined as the exponentially-weighted average of these $k$-step estimators: $ \begin{aligned} \hat{A}_{t}^{\mathrm{GAE}(\gamma, \lambda)}:=&(1-\lambda)\left(\hat{A}_{t}^{(1)}+\lambda \hat{A}_{t}^{(2)}+\lambda^{2} \hat{A}_{t}^{(3)}+\ldots\right) \\ &=(1-\lambda)\left(\delta_{t}^{V}+\lambda\left(\delta_{t}^{V}+\gamma \delta_{t+1}^{V}\right)+\lambda^{2}\left(\delta_{t}^{V}+\gamma \delta_{t+1}^{V}+\gamma^{2} \delta_{t+2}^{V}\right)+\ldots\right) \\ &=(1-\lambda)\left(\delta_{t}^{V}\left(1+\lambda+\lambda^{2}+\ldots\right)+\gamma \delta_{t+1}^{V}\left(\lambda+\lambda^{2}+\lambda^{3}+\ldots\right)\right.\\ &\left.\quad+\gamma^{2} \delta_{t+2}^{V}\left(\lambda^{2}+\lambda^{3}+\lambda^{4}+\ldots\right)+\ldots\right) \\ &=(1-\lambda)\left(\delta_{t}^{V}\left(\frac{1}{1-\lambda}\right)+\gamma \delta_{t+1}^{V}\left(\frac{\lambda}{1-\lambda}\right)+\gamma^{2} \delta_{t+2}^{V}\left(\frac{\lambda^{2}}{1-\lambda}\right)+\ldots\right) \\ =& \sum_{l=0}^{\infty}(\gamma \lambda)^{l} \delta_{t+l}^{V} \end{aligned} $ From Equation aobve, we see that the advantage estimator has a remarkably simple formula involving a discounted sum of Bellman residual terms. Section 4 discusses an interpretation of this formula as the returns in an MDP with a modified reward function. The construction we used above is closely analogous to the one used to define $\mathrm{TD}(\lambda)$ (Sutton \& Barto, 1998), however $\mathrm{TD}(\lambda)$ is an estimator of the value function, whereas here we are estimating the advantage function. There are two notable special cases of this formula, obtained by setting $\lambda=0$ and $\lambda=1$. $ \begin{array}{ll} \operatorname{GAE}(\gamma, 0): & \hat{A}_{t}:=\delta_{t} & =r_{t}+\gamma V\left(s_{t+1}\right)-V\left(s_{t}\right) \\ \operatorname{GAE}(\gamma, 1): & \hat{A}_{t}:=\sum_{l=0}^{\infty} \gamma^{l} \delta_{t+l} & =\sum_{l=0}^{\infty} \gamma^{l} r_{t+l}-V\left(s_{t}\right) \end{array} $ Using the generalized advantage estimator, we can construct a biased estimator of $g^{\gamma}$, the discounted policy gradient from Equation (6): $ g^{\gamma} \approx \mathbb{E}\left[\sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \hat{A}_{t}^{\mathrm{GAE}(\gamma, \lambda)}\right]=\mathbb{E}\left[\sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \sum_{l=0}^{\infty}(\gamma \lambda)^{l} \delta_{t+l}^{V}\right] $ where equality holds when $\lambda=1$ ## The Results --- ## References 1. High-Dimensional Continuous Control Using Generalized Advantage Estimation https://arxiv.org/abs/1506.02438