Pathwise Gradient Estimator

# Pathwise Gradient Estimator Also known as the 'reparameterization trick'. Often the probability density can be rewritten as a deterministic function of a simpler probability density. Pathwise estimators work by transforming simple random samples (like standard normal) into samples from complex distributions using a deterministic function. $ \widehat{x} \sim p_{\varphi}(x) \Leftrightarrow \widehat{x}=g(\hat{\varepsilon}, \varphi), \hat{\varepsilon} \sim p(\varepsilon) $ Because of this, now the stochasticity flows through a simple probability density. And, complexity flows from the deterministic transformation. For NN it means backprop - for deterministic functions only- is possible. At the heart of this method is the change of variables formula $ p_{\varphi}(\boldsymbol{x})=p(\boldsymbol{\varepsilon})\left|\operatorname{det} \nabla_{\boldsymbol{\varepsilon}} g(\boldsymbol{\varepsilon}, \varphi)\right|^{-1} $ We have seen [[Normalizing Flows]] using the same property. ## Deriving the pathwise gradient estimator As a use case the following expectation from VAE: $\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(z \mid x)}[\log p(x \mid z)]$ We also have, $z=g(\varepsilon, \varphi \mid x)=\mu_{x}+\varepsilon \cdot \sigma_{x},$ where $\varphi=\left(\mu_{x}, \sigma_{x}\right) \Rightarrow d z=\sigma_{x} d \varepsilon$ and $\left|\operatorname{det} \nabla_{\varepsilon} g(\varepsilon, \varphi \mid x)\right|=\sigma_{x}$ Now using pathwise gradient estimation, $ \begin{array}{l} \nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(\mathbf{z} \mid x)}[\log p(x \mid \mathbf{z})]= \\ =\nabla_{\varphi} \int_{\mathbf{z}} q_{\varphi}(z \mid x) \log p(x \mid z) d z \\ =\nabla_{\varphi} \int_{\mathbf{z}} \frac{1}{\sigma_{x}} p(\varepsilon) \log p(x \mid g(\varepsilon, \varphi \mid x)) \sigma_{x} d \varepsilon \\ =\int_{\varepsilon} p(\varepsilon) \nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x)) d \varepsilon \\ =\mathbb{E}_{\varepsilon \sim p(\varepsilon)}\left[\nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x))\right] \\ =\frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi \mid x\right)\right), \varepsilon^{(i)} \sim p(\varepsilon) \end{array} $ ## Properties **No need to know the pdf explicitly**: Only the deterministic transformation and the base sampling distribution. **They require differentiable cost functions**: Pathwise estimators work by reparameterizing x = g(ε, θ) where ε is parameter-free noise, then computing: ∇_θ E[f(x)] = E[∇_θ f(g(ε, θ))] = E[∇_x f(x) · ∇_θ g(ε, θ)] This requires ∇_x f(x) to exist, so f(x) must be differentiable. [[REINFORCE - Score Function Estimator]] circumvent this by using the identity: ∇_θ E[f(x)] = E[f(x) ∇_θ log p(x|θ)]. Here, gradients are only taken of the log probability p(x|θ), while the cost function f(x) appears as a multiplicative weight in the expectation. Since f(x) is never differentiated—only the log probability is—f(x) can be any function: discontinuous, discrete-valued, or non-differentiable. **Low variance in general** - Lower than the [[REINFORCE - Score Function Estimator]] - Example: if you compare the VAE score-function and pathwise gradients, the score-function has an extra multiplicative term, which increases variance. $ \frac{1}{n} \sum_{i} \log p\left(x \mid \mathbf{z}^{(i)}\right) \nabla_{\varphi} \log q_{\varphi}\left(\mathbf{z}^{(i)} \mid x\right) \quad \frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi\right)\right) $ **Very efficient**: This is the reason why they were proposed in VAE. Even a single sample suffices no matter dimensionality.