REINFORCE - Score Function Estimator

# REINFORCE - Score Function Estimator Exploiting the following property $ \frac{d}{d x} \log f(x)=\frac{1}{f(x)} \cdot \frac{d f(x)}{d x} $ When our function $f(x)$ is a probability density $ \nabla_{\varphi} \log p_{\varphi}(x)=\frac{1}{p_{\varphi}(x)} \nabla_{\varphi} p_{\varphi}(x) \\ \Rightarrow \nabla_{\varphi} p_{\varphi}(x)=p_{\varphi}(x) \nabla_{\varphi} \log p_{\varphi}(x) $ where $\nabla_{\varphi} \log p(x)$ is known as the score-function. This gives a neat trick to rewrite the gradient of a density as another density. **The key insight:** This is powerful for estimating gradients of expectations $\nabla_\varphi E_{x \sim p_\varphi}[f(x)]$ when we can't backpropagate through the sampling process—particularly with discrete distributions or stochastic nodes in computation graphs. Note that we're not taking gradients of the reward itself. Instead, we're taking gradients of the log-probability of actions that led to those rewards: $\nabla_{\varphi} \log p_{\varphi}(x)$. This lets us do gradient-based policy optimization even when the environment gives us non-differentiable, sparse, or delayed rewards. **Example:** In a game, the reward might be +1 for winning, 0 for losing (non-differentiable). But as long as our policy network $p_\varphi(\text{action}|\text{state})$ is differentiable, we can still optimize it using REINFORCE. ## Deriving the score-function estimator for VAE As a use case the following expectation from VAE: $\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(z \mid x)}[\log p(x \mid z)]$ $ \begin{array}{l} \nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(z \mid x)}[\log p(x \mid z)]= \\ =\nabla_{\varphi} \int_{z} \log p(x \mid z) q_{\varphi}(z \mid x) d z \\ =\int_{z} \log p(x \mid z) \nabla_{\varphi} q_{\varphi}(z \mid x) d z \\ =\int_{z} \log p(x \mid z) q_{\varphi}(z \mid x) \nabla_{\varphi} \log q_{\varphi}(z \mid x) d z \\ =\mathbb{E}_{z \sim q_{\varphi}(z \mid x)}\left[\log p(x \mid z) \nabla_{\varphi} \log q_{\varphi}(z \mid x)\right] \\ =\frac{1}{n} \sum_{i} \log p\left(x \mid z^{(i)}\right) \nabla_{\varphi} \log q_{\varphi}\left(z^{(i)} \mid x\right), z^{(i)} \sim q_{\varphi}(z \mid x) \end{array} $ Thus with REINFORCE, we were able to rewrite non-densities as density, which allows us to use MC estimation! ## Score-function estimator properties Any function $f(x)$ amenable - Good for simulators or black box functions (RL) The $p_{\theta}(x)$ must be differentiable w.r.t. to parameters $\varphi$ It must be easy to sample from $p_{\varphi}(x)$ Unbiased estimator High variance estimator - The gradient will deviate a lot, but in the limit of many samples is accurate - Increases with more dimensions - If you sample once, this can be a problem and slow down or stop learning - Variance reduction methods like [[Control Variates]] are usually needed --- ## References