# PGT Actor-Critic
As seen in [[Temporal Difference Learning]], the one-step return is often superior to the actual return in terms of its variance and computational congeniality, even though it introduces bias.
When the state-value function is used to assess actions, it is called a critic, and the overall policy-gradient method is termed an actor–critic method.
One-step actor–critic methods replace the full return of [[REINFORCE - Monte Carlo Policy Gradient]] with the one-step return (and use a learned state-value function as the baseline) as follows:
$
\begin{aligned}
\boldsymbol{\theta}_{t+1} & \doteq \boldsymbol{\theta}_{t}+\alpha\left(G_{t: t+1}-\hat{v}\left(S_{t}, \mathbf{w}\right)\right) \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \\
&=\boldsymbol{\theta}_{t}+\alpha\left(R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}\right)-\hat{v}\left(S_{t}, \mathbf{w}\right)\right) \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \\
&=\boldsymbol{\theta}_{t}+\alpha \delta_{t} \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}
\end{aligned}
$
![[Actor Critic One-step.png]]
## Policy parameterization for continuous actions
In continuous action spaces, a Gaussian policy is common. E.g., mean is some function of state $\mu(s)$. For simplicity, lets consider fixed variance of $\sigma^{2}$ (can be parametrized as well, instead) Policy is Gaussian, $a \sim \mathcal{N}\left(\mu(s), \sigma^{2}\right)$
The gradient of the log of the policy is then
$
\nabla_{\theta} \log \pi_{\theta}(s, a)=\frac{a-\mu(s)}{\sigma^{2}} \nabla \mu(s)
$
This can be used, for instance, in REINFORCE / advantage actor critic
---
## References