# Policy Gradient ## Advantages of Policy-Based RL Advantages: - Good convergence properties - Easily extended to high-dimensional or continuous action spaces (big advantage) - Value-based methods struggle with continuous action spaces because they need to evaluate Q-values across all possible actions to find the maximum, which becomes computationally intractable due to the infinite number of possible actions. - Policy gradient methods bypass this by directly outputting continuous actions through parameterized policies, enabling finer control and more nuanced action selection. - Can learn stochastic policies, which can help overcome local minimas (not appreciated as widely) - Policy gradient methods are more suitable in this regard because they can learn stochastic policies that naturally handle environment uncertainty. - Value-based methods typically produce deterministic policies which may struggle to adapt to the inherent randomness in stochastic environments where the same action can lead to different outcomes. Note: Value based methods could also provide stochastic policy (e.g., using softmax over Q-values), but not natively. - Sometimes policies are simple while values and models are complex - E.g., rich domain, but optimal is always go left - Policy parameterizing is a good way of injecting priors Disadvantages: - Susceptible to local optima (especially with non-linear function approximation) - Obtained knowledge is specific, does not always generalize well - Ignores a lot of information in the data (when used in isolation) ## Policy Objective Functions Goal: given policy $\pi_{\theta}(s, a)$ with parameters $\theta,$ find best $\theta$ But how do we measure the quality of a policy $\pi_{\theta} ?$ In episodic environments we can use the _start value_ $ J_{1}(\theta)=v_{\pi_{\theta}}\left(s_{1}\right) $ In continuing environments we can use the _average value_ $ J_{a v V}(\theta)=\sum_{s} \mu_{\pi_{\theta}}(s) v_{\pi_{\theta}}(s) $ where $\mu_{\pi}(s)=p\left(S_{t}=s \mid \pi\right)$ is the probability of being in state $s$ in the long run. Think of is as the ratio of time spent in $s$ under policy $\pi$. (Also written as $d$ instead of $\mu$). Or the average _reward per time-step_ $ J_{a v R}(\theta)=\sum_{s} \mu_{\pi_{\theta}}(s) \sum_{a} \pi_{\theta}(s, a) \sum_{r} p(r \mid s, a) r $ ## Policy Optimization Policy based reinforcement learning is an _optimization_ problem. Find $\theta$ that maximises $J(\theta)$. Some approaches do not use gradient - Hill climbing - Genetic algorithms We will focus on stochastic gradient ascent, which is often quite efficient (and easy to use with deep nets) ## Policy Gradient Let $J(\theta)$ be any policy objective function Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t. parameters $\theta$ $ \Delta \theta=\alpha \nabla_{\theta} J(\theta) $ Where $\nabla_{\theta} J(\theta)$ is the policy gradient $ \nabla_{\theta} J(\theta)=\left(\begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_{1}} \\ \vdots \\ \frac{\partial J(\theta)}{\partial \theta_{n}} \end{array}\right) $ and $\alpha$ is a step-size parameter. We need to compute an estimate of the policy gradient. We assume policy $\pi_{\theta}$ is differentiable almost everywhere. - E.g., $\pi_{\theta}$ is a linear function of the agent state, or a neural network - Or we could have a parameterized class of controllers Goal is to compute $ \nabla_{\theta} J(\theta)=\nabla_{\theta} \mathbb{E}_{\mu}\left[v_{\pi_{\theta}}(S)\right] $ We will use Monte Carlo samples to compute this gradient. So, how does $\mathbb{E}_{\mu}\left[v_{\pi_{\theta}}(S)\right]$ depend on $\theta$? Consider a one-step case (a contextual bandit) such that $J(\theta)=\mathbb{E}[R(S, A)]$. (Expectation is over $d$ (states) and $\pi$ (actions)) We cannot sample $R_{t+1}$ and then take a gradient: $R_{t+1}$ is just a number that does not depend on $\theta$. Instead, we use the the score function trick: $ \begin{aligned} \nabla_{\theta} \mathbb{E}[R(S, A)] &=\nabla_{\theta} \sum_{s} d(s) \sum_{a} \pi_{\theta}(a \mid s) R(s, a) \\ &=\sum_{s} d(s) \sum_{a} \nabla_{\theta} \pi_{\theta}(a \mid s) R(s, a) \\ &=\sum_{s} d(s) \sum_{a} \pi_{\theta}(a \mid s) \frac{\nabla_{\theta} \pi_{\theta}(a \mid s)}{\pi_{\theta}(a \mid s)} R(s, a) \\ &=\sum_{s} d(s) \sum_{a} \pi_{\theta}(a \mid s) \nabla_{\theta} \log \pi_{\theta}(a \mid s) R(s, a) \\ &=\mathbb{E}\left[\nabla_{\theta} \log \pi_{\theta}(A \mid S) R(S, A)\right] \end{aligned} $ The right-hand side gives an expected gradient that can be sampled. Our stochastic policy-gradient update is then $ \theta_{t+1}=\theta_{t}+\alpha R_{t+1} \nabla_{\theta} \log \pi_{\theta_{t}}\left(A_{t} \mid S_{t}\right) $ In expectation, this is the following the actual gradient. So this is a pure stochastic gradient algorithm. Intuition of this gradient: increase probability for actions with high rewards. ### Example: Softmax Policy Consider a softmax policy on action preferences $h(s, a)$ as an example. Probability of action is proportional to exponentiated weight. $ \pi_{\theta}(a \mid s)=\frac{e^{h(s, a)}}{\sum_{b} e^{h(s, b)}} $ The gradient of the log probability is $ \nabla_{\theta} \log \pi_{\theta}(a \mid s)=\nabla_{\theta} h(s, a)-\sum_{b} \pi_{\theta}(b \mid s) \nabla_{\theta} h(s, b) $ ## Policy Gradient Theorem The policy gradient theorem provides the gradient of expected cumulative reward with respect to policy parameters: $\nabla J(\theta) \propto \mathbb{E}\left[\sum_{t=0}^{\infty} \nabla \log \pi_\theta(a_t \mid s_t) \cdot Q(s_t, a_t)\right]$ where: - $\nabla J(\theta)$ is the gradient of expected return - $\pi_\theta(a_t \mid s_t)$ is the policy (probability of action $a_t$ in state $s_t$) - $Q(s_t, a_t)$ is the action-value function **Key insight:** Instead of using actual returns $G_t$ (as in [[REINFORCE - Score Function Estimator]]), we can substitute Q-values since they represent expected returns. This dramatically reduces variance and enables actor-critic methods where we learn Q-values separately to update the policy more efficiently than using noisy episode returns. ## Policy gradients on trajectories - Policy gradients _do not_ need to know the dynamics. - Kind of surprising; shouldn't we know how the policy influences the states? Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$ $ \nabla_{\theta} J_{\theta}(\pi)=\nabla_{\theta} \mathbb{E}[G(\zeta)]=\mathbb{E}\left[G(\zeta) \nabla_{\theta} \log p(\zeta)\right] \quad \text { (score function trick) } $ $ \begin{array}{l} \nabla_{\theta} \log p(\zeta) \\ =\nabla_{\theta} [ \log \left[p\left(S_{0}\right) \pi\left(A_{0} \mid S_{0}\right) p\left(S_{1} \mid S_{0}, A_{0}\right) \pi\left(A_{1} \mid S_{1}\right) \cdots\right] \\ =\nabla_{\theta}\left[\log p\left(S_{0}\right)+\log \pi\left(A_{0} \mid S_{0}\right)+\log p\left(S_{1} \mid S_{0}, A_{0}\right)+\log \pi\left(A_{1} \mid S_{1}\right)+\cdots\right] \\ \end{array} $ The gradient of the transition probabilities is 0 with respect to the policy parameters as they are conditioned on the actions and not the policy parameters. Therefore, $ \nabla_{\theta} \log p(\zeta) =\nabla_{\theta}\left[\log \pi\left(A_{0} \mid S_{0}\right)+\log \pi\left(A_{1} \mid S_{1}\right)+\cdots\right] $ So: $ \nabla_{\theta} J_{\theta}(\pi)=\mathbb{E}\left[G(\zeta) \nabla_{\theta} \sum_{t=0} \log \pi\left(A_{t} \mid S_{t}\right)\right]=\mathbb{E}\left[\left(\sum_{t=0} \gamma R_{t+1}\right)\left(\nabla_{\theta} \sum_{t=0} \log \pi\left(A_{t} \mid S_{t}\right)\right)\right] $ The parameters only affect your decisions, they don't affect the dynamics, this why they don't show up in the objective. Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$ $ \nabla_{\theta} J_{\theta}(\pi)=\mathbb{E}\left[\left(\sum_{t=0} R_{t+1}\right)\left(\nabla_{\theta} \sum_{t=0} \log \pi\left(A_{t} \mid S_{t}\right)\right)\right] $ but some of these rewards doesn't depend on some of these actions. In particular, $\sum_{t=0}^{k} R_{t+1}$ does not depend on actions $A_{k+1}, A_{k+2}, \ldots,$ so $ \begin{array}{l} =\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right) \sum_{i=0} R_{i+1}\right] \\ =\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right) \sum_{i=t} R_{i+1}\right] \\ =\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right) q_{\pi}\left(S_{t}, A_{t}\right)\right] \end{array} $ ## Baselines: reduce variance Let's assume a baseline $b$, $ \begin{aligned} \mathbb{E}\left[b \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right)\right] &=\mathbb{E}\left[\sum_{a} \pi\left(a \mid S_{t}\right) b \nabla_{\theta} \log \pi\left(a \mid S_{t}\right)\right] \\ &=\mathbb{E}\left[b \nabla_{\theta} \sum_{a} \pi\left(a \mid S_{t}\right)\right] \\ &=\mathbb{E}\left[b \nabla_{\theta} 1\right] \\ &=0 \end{aligned} $ This holds only if $b$ does not depend on the action, though it can be a function of state. This implies we can subtract a baseline to reduce variance. A good baseline is $v_{\pi}\left(S_{t}\right)$. So, $ \nabla_{\theta} J_{\theta}(\pi)=\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right)\left(q_{\pi}\left(S_{t}, A_{t}\right)-v_{\pi}\left(S_{t}\right)\right)\right] $ Typically, we estimate $v_{w}(s)$ explicitly, and sample $ q_{\pi}\left(S_{t}, A_{t}\right) \approx G_{t}^{(n)} $ For instance, $G_{t}^{(1)}=R_{t+1}+\gamma v_{w}\left(S_{t+1}\right)$ --- ## References