# Challenges of optimizing deep neural networks It is not very easy to optimize deep neural network, because: 1. Ill conditioning -> a strong gradient might not even be good enough 2. Local optimization is susceptive to local minima 3. Plateaus, cliffs and pathological curvatures 4. Vanishing and exploding gradients 5. Long-term dependencies ### Ill-conditioning We can analyze possible behaviors of the neural network loss function by examining it's [[Taylor Expansion]] Resort to the $2^{\text {nd }}$ order Taylor dynamics around the current weight $\mathrm{w}^{\prime}$ $ \mathcal{L}(\boldsymbol{w})=\left(\mathcal{L}\left(\boldsymbol{w}^{\prime}\right)+\boldsymbol{g}\left(\boldsymbol{w}-\boldsymbol{w}^{\prime}\right)+\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{w}^{\prime}\right)^{\mathrm{T}} \mathbf{H}\left(\mathbf{w}-\mathbf{w}^{\prime}\right)\right. \text { where } \boldsymbol{g}=\frac{d \mathcal{L}}{d w} $ If we analyze the loss around the current weight $\mathbf{w}^{\prime}$ plus a small step $\left.\boldsymbol{w} \leftarrow \hat{\boldsymbol{w}}^{\prime}\right.-\varepsilon \boldsymbol{g} \boldsymbol{\rho} \Rightarrow \mathcal{L}\left(\boldsymbol{w}^{\prime}-\varepsilon \boldsymbol{g}\right) \approx \mathcal{L}\left(\boldsymbol{w}^{\prime}\right)-\varepsilon \boldsymbol{g}^{\mathrm{T}} \boldsymbol{g}+\frac{1}{2} \boldsymbol{g}^{T} \boldsymbol{H} \boldsymbol{g}$ There are cases where $g$ is "strong" but $\frac{1}{2} g^{T} H g>\varepsilon g^{\mathrm{T}} g$. In this case the loss would still go higher after we take the gradient step, meaning we are unlearning. ### Local Minima Stochasticity alone is not always enough to escape local minima. You must realize that these nice visualization are our own imagination. In practice, we and the NN are blind to what the landscape really looks like. Our best hope is to simply get the optimization right. ### Pathological curvatures One of the difficulties is the pathological curvatures in our loss landscape. #### Ravines Ravines are location where there is great steepness in one direction and flatness in another, which means the gradient is large in one direction and small in another. They are quite common in loss landscape. ![[loss-ravines.jpg]] [Image Source](https://medium.com/paperspace/intro-to-optimization-in-deep-learning-momentum-rmsprop-and-adam-8335f15fdee2) #### Plateuaus/Flat areas In flat areas, there is almost zero-gradients, which means no updates and learning is very very slow. However, flat areas that are minima generalize well. ![[minimas.jpg]] [Image Source](https://www.inference.vc/everything-that-works-works-because-its-bayesian-2/) #### Flat areas, steep minima When combining flat areas with very steep minima, it becomes very challenging How do we even get to the area where the steep minima starts? ![[steep-minima.jpg]] --- ## References 1. Lecture 3.1, UvA DL course 2020