# Pathwise Gradient Estimator
Also known as the 'reparameterization trick'.
Often the probability density can be rewritten as a deterministic function of a simpler probability density. Pathwise estimators work by transforming simple random samples (like standard normal) into samples from complex distributions using a deterministic function.
$
\widehat{x} \sim p_{\varphi}(x) \Leftrightarrow \widehat{x}=g(\hat{\varepsilon}, \varphi), \hat{\varepsilon} \sim p(\varepsilon)
$
Because of this, now the stochasticity flows through a simple probability density. And, complexity flows from the deterministic transformation. For NN it means backprop - for deterministic functions only- is possible.
At the heart of this method is the change of variables formula
$
p_{\varphi}(\boldsymbol{x})=p(\boldsymbol{\varepsilon})\left|\operatorname{det} \nabla_{\boldsymbol{\varepsilon}} g(\boldsymbol{\varepsilon}, \varphi)\right|^{-1}
$
We have seen [[Normalizing Flows]] using the same property.
## Deriving the pathwise gradient estimator
As a use case the following expectation from VAE: $\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(z \mid x)}[\log p(x \mid z)]$
We also have, $z=g(\varepsilon, \varphi \mid x)=\mu_{x}+\varepsilon \cdot \sigma_{x},$ where $\varphi=\left(\mu_{x}, \sigma_{x}\right) \Rightarrow d z=\sigma_{x} d \varepsilon$
and $\left|\operatorname{det} \nabla_{\varepsilon} g(\varepsilon, \varphi \mid x)\right|=\sigma_{x}$
Now using pathwise gradient estimation,
$
\begin{array}{l}
\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(\mathbf{z} \mid x)}[\log p(x \mid \mathbf{z})]= \\
=\nabla_{\varphi} \int_{\mathbf{z}} q_{\varphi}(z \mid x) \log p(x \mid z) d z \\
=\nabla_{\varphi} \int_{\mathbf{z}} \frac{1}{\sigma_{x}} p(\varepsilon) \log p(x \mid g(\varepsilon, \varphi \mid x)) \sigma_{x} d \varepsilon \\
=\int_{\varepsilon} p(\varepsilon) \nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x)) d \varepsilon \\
=\mathbb{E}_{\varepsilon \sim p(\varepsilon)}\left[\nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x))\right] \\
=\frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi \mid x\right)\right), \varepsilon^{(i)} \sim p(\varepsilon)
\end{array}
$
## Properties
**No need to know the pdf explicitly**: Only the deterministic transformation and the base sampling distribution.
**They require differentiable cost functions**: Pathwise estimators work by reparameterizing x = g(ε, θ) where ε is parameter-free noise, then computing:
∇_θ E[f(x)] = E[∇_θ f(g(ε, θ))] = E[∇_x f(x) · ∇_θ g(ε, θ)]
This requires ∇_x f(x) to exist, so f(x) must be differentiable.
[[REINFORCE - Score Function Estimator]] circumvent this by using the identity: ∇_θ E[f(x)] = E[f(x) ∇_θ log p(x|θ)]. Here, gradients are only taken of the log probability p(x|θ), while the cost function f(x) appears as a multiplicative weight in the expectation. Since f(x) is never differentiated—only the log probability is—f(x) can be any function: discontinuous, discrete-valued, or non-differentiable.
**Low variance in general**
- Lower than the [[REINFORCE - Score Function Estimator]]
- Example: if you compare the VAE score-function and pathwise gradients, the score-function has an extra multiplicative term, which increases variance.
$
\frac{1}{n} \sum_{i} \log p\left(x \mid \mathbf{z}^{(i)}\right) \nabla_{\varphi} \log q_{\varphi}\left(\mathbf{z}^{(i)} \mid x\right) \quad \frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi\right)\right)
$
**Very efficient**: This is the reason why they were proposed in VAE. Even a single sample suffices no matter dimensionality.