# REINFORCE - Monte Carlo Policy Gradient Recall the overall strategy of stochastic gradient ascent in [[Policy Gradient]], which requires a way to obtain samples such that the expectation of the sample gradient is proportional to the actual gradient of the performance measure as a function of the parameter. The sample gradients need only be proportional to the gradient because any constant of proportionality can be absorbed into the step size, which is otherwise arbitrary. Policy gradient theorem gives an exact expression: $ \begin{aligned} \nabla J(\boldsymbol{\theta}) & \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a \mid s, \boldsymbol{\theta}) \\ &=\mathbb{E}_{\pi}\left[\sum_{a} q_{\pi}\left(S_{t}, a\right) \nabla \pi\left(a \mid S_{t}, \boldsymbol{\theta}\right)\right] \end{aligned} $ We are only interested in the one action actually taken at time t. We do this by, $ \begin{aligned} \nabla J(\boldsymbol{\theta}) & \propto \mathbb{E}_{\pi}\left[\sum_{a} \pi\left(a \mid S_{t}, \boldsymbol{\theta}\right) q_{\pi}\left(S_{t}, a\right) \frac{\nabla \pi\left(a \mid S_{t}, \boldsymbol{\theta}\right)}{\pi\left(a \mid S_{t}, \boldsymbol{\theta}\right)}\right] \\ &\left.=\mathbb{E}_{\pi}\left[q_{\pi}\left(S_{t}, A_{t}\right) \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}\right)}\right] \quad \text { (replacing } a \text { by the sample } A_{t} \sim \pi\right) \\ &\left.=\mathbb{E}_{\pi}\left[G_{t} \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}\right)}\right], \quad \text { (because } \mathbb{E}_{\pi}\left[G_{t} \mid S_{t}, A_{t}\right]=q_{\pi}\left(S_{t}, A_{t}\right)\right) \end{aligned} $ The final expression in brackets is exactly what is needed, a quantity that can be sampled on each time step whose expectation is proportional to the gradient. Thus REINFORCE update is given as: $ \boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_{t}+\alpha G_{t} \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} $ The update increases the parameter vector in this direction proportional to the return, and inversely proportional to the action probability. The former makes sense because it causes the parameter to move most in the directions that favor actions that yield the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage. ![[REINFORCE.png]] As a [[Stochastic gradients]] method, REINFORCE has good theoretical convergence properties. However, as a Monte Carlo method REINFORCE may be of high variance and thus produce slow learning. ## Reducing Variance with Baseline Baselines are [[Control variates]] method to leave the expected value unchanged but reduce variance of estimates. In [[Multi-Armed Bandits]], baseline are just a number, but for MDPs, baseline should vary with state. One natural choice is to use an estimate of state value i.e. V function. ![[REINFORCE with Baseline.png]] --- ## References 1. Chapter 13, RL:AI 2nd Edition, Sutton and Barto