Stochastic Gradients - Notes on AI

# Stochastic Gradients Stochastic gradients are gradient estimates computed from random samples, used when dealing with stochastic objectives - objectives that inherently involve randomness. Examples include: - [[Variational Autoencoders]]: $\mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{KL}(q(z|x)||p(z))$ - gradients involve sampling $z$ from encoder - [[Policy Gradient]]: $\mathbb{E}_{\pi}[R(\tau)]$ - gradients require sampling trajectories $\tau$ from policy $\pi$ - [[Variational Inference]]: $\mathbb{E}_{q(z)}[\log p(x,z) - \log q(z)]$ - gradients involve sampling from approximate posterior $q(z)$ In these cases, the gradients are stochastic because the objective function itself contains random variables. Various techniques like [[Monte-Carlo Estimation]], reparameterization tricks, and [[REINFORCE - Score Function Estimator]] are used to estimate these gradients. ## Challenges $ \eta=\nabla_{\varphi} \mathcal{F}(\varphi)=\nabla_{\varphi} \mathbb{E}_{\boldsymbol{x} \sim p_{\varphi}(\boldsymbol{x})}\left[f_{\theta}(\boldsymbol{x})\right] $ - $x$ is typically high dimensional - The parameters $\varphi$ are often in the order of thousands - The cost function is often not differentiable or even unknown - That is, the expectation (integral) is often intractable, we must estimate it instead, with MC integration. ## Desired properties of MC estimators for gradients - Consistency: When sampling more samples the estimator $\hat{y}$ should get closer to the true $y$ - Unbiasedness: Guarantees convergence of stochastic optimization - Low variance - Few samples should suffice - Less jiggling i.e. gradient updates in consistent direction which results in more efficient learning - Computational efficiency: Should be easy to sample and estimate ## Stochastic optimization loop ![[stochastic gradient pipeline.jpg]] ## Qualitative comparision between estimators - Pathwise gradients have consistently lower variance ![[comparision.jpg]] For complex functions the pathwise gradient might have higher variance ![[pathwise-comparision.jpg]] ## Straight-through gradients Often, gradients are hard or impossible to compute. For instance, if we have binary stochastic variables $z \sim f(x), z \in\{0,1\}$ If we compute the derivative on the sample we would have $\frac{d z}{d x}=0$ $z$ is a constant value (not a function). A popular alternative is straight-through gradients - We set the gradient is $\frac{d z}{d x}=1$ - Another alternative is to set the gradient $\frac{d z}{d x}=\frac{d f}{d x}$ However, straight-through gradients introduce bias as our estimated gradient is different from the true gradient. ## Variance reduction in deep networks [[Control Variates]] REBAR (Tucker et al.) - https://arxiv.org/abs/1703.07370 - Low variance, unbiased gradient estimates for discrete latent variables - Inspired by REINFORCE and continuous relaxations - Removing the bias from the continuous relaxation RELAX (Grathwohl et al.) - https://arxiv.org/pdf/1711.00123.pdf - Low variance, unbiased gradient estimates for black box functions - Applicable to discrete and continuous settings ## Low bias low variance gradients Paper: Pervez, Cohen and Gavves, Low Bias Low Variance Gradient Estimates for Hierarchical Boolean Stochastic Neticorks Existing methods have troubles with deep Boolean stochastic nets Successive straight-through in multiple layers fails - Efficient but the bias accumulates over multiple layers - Optimization quickly gets stuck and learning stops Using unbiased estimates (REBAR, RELAX) is too inefficient Expand boolean networks with harmonic analysis (Fourier) - Bias and variance is caused by higher order coefficients - Manipulates those coefficients to reduce bias and variance Can train up to 80 layers instead of 2 ## References 1. Monte Carlo Gradient Estimation in Machine Learning https://arxiv.org/pdf/1906.10652.pdf (awesome paper)