Deterministic Policy Gradient

# Deterministic Policy Gradient In methods described above, the policy function $\pi(. \mid s)$ is always modeled as a probability distribution over actions $\mathcal{A}$ given the current state and thus it is stochastic. Deterministic policy gradient (DPG) instead models the policy as a deterministic decision: $a=\mu(s)$. - $\rho_{0}(s)$ : The initial distribution over states - $\rho^{\mu}\left(s \rightarrow s^{\prime}, k\right)$ : Starting from state s, the visitation probability density at state s' after moving $\mathrm{k}$ steps by policy $\mu$. - $\rho^{\mu}\left(s^{\prime}\right)$ : Discounted state distribution, defined as $\rho^{\mu}\left(s^{\prime}\right)=\int_{S} \sum_{k=1}^{\infty} \gamma^{k-1} \rho_{0}(s) \rho^{\mu}\left(s \rightarrow s^{\prime}, k\right) d s$. The objective function to optimize for is listed as follows: $ J(\theta)=\int_{S} \rho^{\mu}(s) Q\left(s, \mu_{\theta}(s)\right) d s $ Deterministic policy gradient theorem: Now it is the time to compute the gradient! According to the chain rule, we first take the gradient of $\mathrm{Q}$ w.r.t. the action a and then take the gradient of the deterministic policy function $\mu$ w.r.t. $\theta$ : $ \begin{aligned} \nabla_{\theta} J(\theta) &=\left.\int_{S} \rho^{\mu}(s) \nabla_{a} Q^{\mu}(s, a) \nabla_{\theta} \mu_{\theta}(s)\right|_{a=\mu_{\theta}(s)} d s \\ &=\mathbb{E}_{s \sim \rho^{\mu}}\left[\left.\nabla_{a} Q^{\mu}(s, a) \nabla_{\theta} \mu_{\theta}(s)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} $ Let's consider an example of on-policy actor-critic algorithm to showcase the procedure. In each iteration of on-policy actor-critic, two actions are taken deterministically $a=\mu_{\theta}(s)$ and the SARSA update on policy parameters relies on the new gradient that we just computed above: $ \begin{aligned} \delta_{t} &=R_{t}+\gamma Q_{w}\left(s_{t+1}, a_{t+1}\right)-Q_{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q_{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{a} Q_{w}\left(s_{t}, a_{t}\right) \nabla_{\theta} \mu_{\theta}(s)\right|_{a=\mu_{\theta}(s)} \end{aligned} $ TD error in SARSA Deterministic policy gradient theorem --- ## References 1. Deterministic Policy Gradient Algorithms http://proceedings.mlr.press/v32/silver14.pdf 2. https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#dpg