Vanishing and Exploding Gradients

# Vanishing and Exploding Gradients Very deep networks stop learning after a bit. After an accuracy is reached, the network saturates and starts unlearning. Signal gets lost through so many layers, and models start failing. ## Gradients behavior We expect consistent behavior per module to train very deep networks. Let's check the backpropagation gradients $ \left.\frac{\partial \mathcal{L}}{\partial w_{l}}=\frac{\partial \mathcal{L}}{\partial a_{L}} \cdot \frac{\partial a_{L}}{\partial a_{L-1}} \cdot \frac{\partial a_{L-1}}{\partial a_{L-2}} \cdot \ldots \cdot \frac{\partial a_{l}}{\partial w_{l}}\right. $ The gradient depends on a product of $L$ Jacobian matrices/tensors $ \prod_{j=l+1}^{L} \frac{\partial a_{j}}{\partial a_{j-1}} $ What is the relation between gradient norm (magnitude) and depth $L$? ### Spectral (matric) norm of Jacobian Spectral norm of the jacobian helps us answer: - After the module, does our input vector get larger in magnitude or smaller? - To check this, we should check the matrix (spectral) norm - How to compute spectral norm? The spectral norm is the largest of the singular values, computed with [[Singular Value Decomposition (SVD)]]. Singular values are the square roots of matrix eigenvalues. Spectral norm captures the biggest possible change any of the possible the input can sustain. For simplicity, assuming each module to be - a linear operator $w: x \rightarrow y$ (our linear transformation module) - followed by a nonlinearity $h$ $ a_{j}=h\left(w_{j} \cdot a_{j-1}\right) $ The spectral norm of the Jacobian is bounded by - the spectral norm of the linear operator - multiplied by the spectral norm of the nonlinear operator gradient $h^{\prime}$ $ \left\|\frac{\partial \boldsymbol{a}_{j}}{\partial \boldsymbol{a}_{j-1}}\right\| \leq\left\|\boldsymbol{w}_{\boldsymbol{j}}^{T}\right\| \cdot\left\|\operatorname{diag}\left(h^{\prime}\left(\boldsymbol{a}_{j-1}\right)\right)\right\| $ - assuming an element-wise nonlinearity (non-diagonal entries are 0$)$ ### Combining per module spectral norms Our final spectral norm is bounded by $ \begin{aligned} \left\|\frac{\partial \mathcal{L}}{\partial w_{l}}\right\| \propto\left\|\prod_{j=l+1}^{L} \frac{\partial a_{j}}{\partial a_{j-1}}\right\| & \leq \prod_{j=l+1}^{L}\left\|\boldsymbol{w}_{j}^{T}\right\| \prod_{j=l+1}^{L}\left\|\operatorname{diag}\left(h^{\prime}\left(\boldsymbol{a}_{j}\right)\right)\right\| \\ &=\prod_{j=l}^{L} \sigma_{j}^{a} \cdot \sigma_{j}^{h^{\prime}} \end{aligned} $ Where $\sigma_{j}$ is the maximum singular value for module $j$. THis quantity is interesting as it connects the depth of our network woth the spectral norm of the jacobians. ### Implications As depth $L$ becomes larger $ \left\|\frac{\partial \mathcal{L}}{\partial w_{l}}\right\| \leq \prod_{j=l}^{L} \sigma_{j}^{a} \cdot \sigma_{j}^{h^{\prime}} $ For singular values $\sigma_{j}<1$ we could obtain very small i.e. vanishing gradients. - E.g., $\sigma_{j}=0.5$ and 10 layers we would have a norm of $9.7 \cdot 10^{-5}$ - Very small gradients, learning is slowed down significantly For singular values $\sigma_{j}>1$ we could obtain ever-growing i.e. exploding gradients - E.g., $\sigma_{j}=1.5$ and 10 layers we would have a norm of $4.06 \cdot 10^{17}$ - Unstable optimization, oscillations, divergence As depth $L$ becomes larger $ \left\|\frac{\partial \mathcal{L}}{\partial w_{l}}\right\| \leq \prod_{j=l}^{L} \sigma_{j}^{a} \cdot \sigma_{j}^{h^{\prime}} $ Effects exponential to layer depth Layers closer to the loss - less multiplications -> less exponentiation (a bit linear) -> little effect Layers further way from the loss - more multiplications -> good ol' exponential growth -> dramatic effects This gives us the answer why adapting the learning rate isn't enough. Different modules based on different location have different behaviors, so we would have to adapat based on the location. Moreover, smalller gradient might also mean we have converged and not necessarily vanishing gradients. --- ## References 1. Lecture 5.2, UvA DL course 2020