# Stochastic gradients
Awesome paper: Monte Carlo Gradient Estimation in Machine Learning, Shakir Mohamed et al. 2020 https://arxiv.org/pdf/1906.10652.pdf
In stochastic learning, we have a general probabilistic objective
$
\mathcal{F}(\varphi)=\int p_{\varphi}(\boldsymbol{x}) f_{\theta}(\boldsymbol{x}) d \boldsymbol{x}
$
where,
$p_{\varphi}(x)$ : continuous probability distribution differentiable w.r.t. $\varphi$ and $f_{\theta}(x)$ is the structured cost with structural parameters $\theta$
Learning means we want to optimize $\mathcal{F}$ w.r.t. $\varphi($ and $\theta)$. This learning objective already looks like an [[Monte-Carlo Estimation|MC estimator]].
To optimize the learning objective we must take gradients $\frac{d}{d \varphi} \mathcal{F} {(\varphi)}$. The learning objective is stochastic $\rightarrow$ the gradients are stochastic
$
\frac{d}{d \varphi} \mathcal{F}(\varphi)=\nabla_{\varphi} \mathbb{E}_{x \sim p_{\varphi}(x)}\left[f_{\theta}(x)\right]
$
Except for simple cases, the stochastic gradients cannot be computed analytically.
We must resort to MC estimation instead. But now we do not prescribe a specific pdf $p(x)$, instead we prescribe a family of $p_{\varphi}(x)$ and learn the best possible $\varphi$ in the process.
## Challenges
$
\eta=\nabla_{\varphi} \mathcal{F}(\varphi)=\nabla_{\varphi} \mathbb{E}_{\boldsymbol{x} \sim p_{\varphi}(\boldsymbol{x})}\left[f_{\theta}(\boldsymbol{x})\right]
$
- $x$ is typically high dimensional
- The parameters $\varphi$ are often in the order of thousands
- The cost function is often not differentiable or even unknown
- That is, the expectation (integral) is often intractable, we must estimate it instead, with MC integration.
## Desired properties of MC estimators for gradients
- Consistency
- When sampling more samples the estimator $\hat{y}$ should get closer to the true $y$
- Unbiasedness:
- Guarantees convergence of stochastic optimization
- Low variance
- Few samples should suffice
- Less jiggling i.e. gradient updates in consistent direction which results in more efficient learning
- Computational efficiency
- Should be easy to sample and estimate
## Stochastic optimization loop
![[stochastic gradient pipeline.jpg]]
## Applications
[[Variational Inference]]
$
\nabla_{\varphi} \mathbb{E}_{q_{\varphi}(z \mid x)}\left[\log p(x \mid z)-\log \frac{q_{\varphi}(z \mid x)}{p(z)}\right]
$
[[Reinforcement Learning]]
$
\nabla_{\varphi} \mathbb{E}_{p_{\varphi}(\tau)}\left[\sum_{t} \gamma_{t} r\left(\boldsymbol{s}_{t}, \boldsymbol{a}_{t}\right)\right]
$
Where $\boldsymbol{\tau}=\left(\boldsymbol{s}_{1}, \boldsymbol{a}_{1}, \boldsymbol{s}_{2}, \ldots\right)$ are trajectories over time $\bar{t}$ and $\gamma_{t}$ are discount factors and $r$ is the reward
Outside ML and DL: Sensitivity analysis, Discrete event systems and queuing theory, Experimental design
## Score function estimator
[[REINFORCE - score function estimator]]
## Pathwise gradient estimator
[[Pathwise gradient estimator]]
## Qualitative comparision between estimators
- Pathwise gradients have consistently lower variance
![[comparision.jpg]]
For complex functions the pathwise gradient might have higher variance
![[pathwise-comparision.jpg]]
## Straight-through gradients
Often, gradients are hard or impossible to compute. For instance, if we have binary stochastic variables $z \sim f(x), z \in\{0,1\}$
If we compute the derivative on the sample we would have $\frac{d z}{d x}=0$
$z$ is a constant value (not a function).
A popular alternative is straight-through gradients
- We set the gradient is $\frac{d z}{d x}=1$
- Another alternative is to set the gradient $\frac{d z}{d x}=\frac{d f}{d x}$
However, straight-through gradients introduce bias as our estimated gradient is different from the true gradient.
## Variance reduction in deep networks
[[Control variates]]
REBAR (Tucker et al.) - https://arxiv.org/abs/1703.07370
- Low variance, unbiased gradient estimates for discrete latent variables
- Inspired by REINFORCE and continuous relaxations
- Removing the bias from the continuous relaxation
RELAX (Grathwohl et al.) - https://arxiv.org/pdf/1711.00123.pdf
- Low variance, unbiased gradient estimates for black box functions
- Applicable to discrete and continuous settings
## Low bias low variance gradients
Paper: Pervez, Cohen and Gavves, Low Bias Low Variance Gradient Estimates for Hierarchical Boolean Stochastic Neticorks
Existing methods have troubles with deep Boolean stochastic nets
Successive straight-through in multiple layers fails
- Efficient but the bias accumulates over multiple layers
- Optimization quickly gets stuck and learning stops
Using unbiased estimates (REBAR, RELAX) is too inefficient
Expand boolean networks with harmonic analysis (Fourier)
- Bias and variance is caused by higher order coefficients
- Manipulates those coefficients to reduce bias and variance
Can train up to 80 layers instead of 2
---
## References
1. Monte Carlo Gradient Estimation in Machine Learning https://arxiv.org/pdf/1906.10652.pdf