PGT Actor-Critic - Notes on AI

# PGT Actor-Critic As seen in [[Temporal Difference Learning]], the one-step return is often superior to the actual return in terms of its variance and computational congeniality, even though it introduces bias. When the state-value function is used to assess actions, it is called a critic, and the overall policy-gradient method is termed an actor–critic method. One-step actor–critic methods replace the full return of [[REINFORCE - Monte Carlo Policy Gradient]] with the one-step return (and use a learned state-value function as the baseline) as follows: $ \begin{aligned} \boldsymbol{\theta}_{t+1} & \doteq \boldsymbol{\theta}_{t}+\alpha\left(G_{t: t+1}-\hat{v}\left(S_{t}, \mathbf{w}\right)\right) \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \\ &=\boldsymbol{\theta}_{t}+\alpha\left(R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}\right)-\hat{v}\left(S_{t}, \mathbf{w}\right)\right) \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \\ &=\boldsymbol{\theta}_{t}+\alpha \delta_{t} \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \end{aligned} $ ![[Actor Critic One-step.png]] ## Policy parameterization for continuous actions In continuous action spaces, a Gaussian policy is common. E.g., mean is some function of state $\mu(s)$. For simplicity, lets consider fixed variance of $\sigma^{2}$ (can be parametrized as well, instead) Policy is Gaussian, $a \sim \mathcal{N}\left(\mu(s), \sigma^{2}\right)$ The gradient of the log of the policy is then $ \nabla_{\theta} \log \pi_{\theta}(s, a)=\frac{a-\mu(s)}{\sigma^{2}} \nabla \mu(s) $ This can be used, for instance, in REINFORCE / advantage actor critic --- ## References