# Deterministic Policy Gradient
In methods described above, the policy function $\pi(. \mid s)$ is always modeled as a probability distribution over actions $\mathcal{A}$ given the current state and thus it is stochastic. Deterministic policy gradient (DPG) instead models the policy as a deterministic decision: $a=\mu(s)$.
- $\rho_{0}(s)$ : The initial distribution over states
- $\rho^{\mu}\left(s \rightarrow s^{\prime}, k\right)$ : Starting from state s, the visitation probability density at state s' after moving $\mathrm{k}$ steps by policy $\mu$.
- $\rho^{\mu}\left(s^{\prime}\right)$ : Discounted state distribution, defined as $\rho^{\mu}\left(s^{\prime}\right)=\int_{S} \sum_{k=1}^{\infty} \gamma^{k-1} \rho_{0}(s) \rho^{\mu}\left(s \rightarrow s^{\prime}, k\right) d s$.
The objective function to optimize for is listed as follows:
$
J(\theta)=\int_{S} \rho^{\mu}(s) Q\left(s, \mu_{\theta}(s)\right) d s
$
Deterministic policy gradient theorem: Now it is the time to compute the gradient! According to the chain rule, we first take the gradient of $\mathrm{Q}$ w.r.t. the action a and then take the gradient of the deterministic policy function $\mu$ w.r.t. $\theta$ :
$
\begin{aligned}
\nabla_{\theta} J(\theta) &=\left.\int_{S} \rho^{\mu}(s) \nabla_{a} Q^{\mu}(s, a) \nabla_{\theta} \mu_{\theta}(s)\right|_{a=\mu_{\theta}(s)} d s \\
&=\mathbb{E}_{s \sim \rho^{\mu}}\left[\left.\nabla_{a} Q^{\mu}(s, a) \nabla_{\theta} \mu_{\theta}(s)\right|_{a=\mu_{\theta}(s)}\right]
\end{aligned}
$
Let's consider an example of on-policy actor-critic algorithm to showcase the procedure. In each iteration of on-policy actor-critic, two actions are taken deterministically $a=\mu_{\theta}(s)$ and the SARSA update on policy parameters relies on the new gradient that we just computed above:
$
\begin{aligned}
\delta_{t} &=R_{t}+\gamma Q_{w}\left(s_{t+1}, a_{t+1}\right)-Q_{w}\left(s_{t}, a_{t}\right) \\
w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q_{w}\left(s_{t}, a_{t}\right) \\
\theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{a} Q_{w}\left(s_{t}, a_{t}\right) \nabla_{\theta} \mu_{\theta}(s)\right|_{a=\mu_{\theta}(s)}
\end{aligned}
$
TD error in SARSA
Deterministic policy gradient theorem
---
## References
1. Deterministic Policy Gradient Algorithms http://proceedings.mlr.press/v32/silver14.pdf
2. https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#dpg