# Depth and Trainability
Trainability depends on model design choices:
1. Neural network architecture
2. [[Adaptive Learning Rate Optimizers]]
3. [[Weight Initialization]]
4. [[Hyperparameters in Deep Neural Networks|Hyperparams]]
## Smoothening the loss surface
Based on paper: Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018
Why do residual connections make neural networks more trainable?
- Adding skip connections makes the loss surface less rough
- Gradients more representative of the direction to good local minima
![[loss-surfaces.jpg]]
Note: Use visualizations with a grain of salt: dramatic dimensionality reduction!
## The effect of depth
- Deeper architectures have more uneven, chaotic surfaces and many minima
- Removing skip connections fragments and elongates the loss surface
- Fragmentation requires good initialization
- Flatter minima accompanied by lower test errors
![[effect-on-depth.jpg]]
## The effect of depth in wider architectures
- Similar conclusions when increasing width
- Width makes the loss surface even smoother and flatter!
![[effect-on-widt.jpg]]
## The effect of weight decay on optimizer trajectory
- Weight decay encourages optimization trajectory perpendicular to isocurves
- Turning off weight decay, the optimizer often goes in parallel with isocurves
![[effect-on-optimizers.jpg]]
## Why skip connections make loss surfaces smoother?
The gradient with skip connection becomes
$
\frac{\partial \mathcal{L}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot\left(\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{x}}\right)=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}}
$
Which means, the previous layer gradient is carried to the next module untouched. The problem of [[Vanishing and Exploding Gradients]] becomes less problematic. Seen otherwise, the loss surface corresponds to stronger gradients, i.e., smoother.
---
## References
1. Lecture 5.5, UvA DL course 2020
2. Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018