# Pathwise gradient estimator
Also known as the 'reparameterization trick'.
Often the probability density can be rewritten as a deterministic function of a simpler probability density.
Instead of sampling from a complex pdf, we can sample from the simpler one then transform deterministically the sample
$
\widehat{x} \sim p_{\varphi}(x) \Leftrightarrow \widehat{x}=g(\hat{\varepsilon}, \varphi), \hat{\varepsilon} \sim p(\varepsilon)
$
Because of this, now the stochasticity flows through a simple probability density. And, complexity flows from the deterministic transformation. For NN it means backprop - for deterministic functions only- is possible.
At the heart of this method is the change of variables formula
$
p_{\varphi}(\boldsymbol{x})=p(\boldsymbol{\varepsilon})\left|\operatorname{det} \nabla_{\boldsymbol{\varepsilon}} g(\boldsymbol{\varepsilon}, \varphi)\right|^{-1}
$
We have seen [[Normalizing Flows]] using the same property.
## Deriving the pathwise gradient estimator
As a use case the following expectation from VAE: $\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(z \mid x)}[\log p(x \mid z)]$
We also have, $z=g(\varepsilon, \varphi \mid x)=\mu_{x}+\varepsilon \cdot \sigma_{x},$ where $\varphi=\left(\mu_{x}, \sigma_{x}\right) \Rightarrow d z=\sigma_{x} d \varepsilon$
and $\left|\operatorname{det} \nabla_{\varepsilon} g(\varepsilon, \varphi \mid x)\right|=\sigma_{x}$
Now using pathwise gradient estimation,
$
\begin{array}{l}
\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(\mathbf{z} \mid x)}[\log p(x \mid \mathbf{z})]= \\
=\nabla_{\varphi} \int_{\mathbf{z}} q_{\varphi}(z \mid x) \log p(x \mid z) d z \\
=\nabla_{\varphi} \int_{\mathbf{z}} \frac{1}{\sigma_{x}} p(\varepsilon) \log p(x \mid g(\varepsilon, \varphi \mid x)) \sigma_{x} d \varepsilon \\
=\int_{\varepsilon} p(\varepsilon) \nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x)) d \varepsilon \\
=\mathbb{E}_{\varepsilon \sim p(\varepsilon)}\left[\nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x))\right] \\
=\frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi \mid x\right)\right), \varepsilon^{(i)} \sim p(\varepsilon)
\end{array}
$
## Properties
Only differentiable cost functions, otherwise we cannot compute the $\nabla_{\varphi} f(x, g(\varepsilon, \varphi))$ unlike score-function estimators that work with any cost function.
No need to know the pdf explicitly, only the deterministic transformation and the base sampling distribution.
Low variance in general
- Lower than the [[REINFORCE - score function estimator]]
- Example: if you compare the VAE score-function and pathwise gradients, the score-function has an extra multiplicative term, which increases variance.
$
\frac{1}{n} \sum_{i} \log p\left(x \mid \mathbf{z}^{(i)}\right) \nabla_{\varphi} \log q_{\varphi}\left(\mathbf{z}^{(i)} \mid x\right) \quad \frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi\right)\right)
$
Very efficient (reason why proposed in VAE). Even a single sample suffices no matter dimensionality.
---
## References