Depth and Trainability

# Depth and Trainability Trainability depends on model design choices: 1. Neural network architecture 2. [[Adaptive Learning Rate Optimizers]] 3. [[Weight Initialization]] 4. [[Hyperparameters in Deep Neural Networks|Hyperparams]] ## Smoothening the loss surface Based on paper: Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018 Why do residual connections make neural networks more trainable? - Adding skip connections makes the loss surface less rough - Gradients more representative of the direction to good local minima ![[loss-surfaces.jpg]] Note: Use visualizations with a grain of salt: dramatic dimensionality reduction! ## The effect of depth - Deeper architectures have more uneven, chaotic surfaces and many minima - Removing skip connections fragments and elongates the loss surface - Fragmentation requires good initialization - Flatter minima accompanied by lower test errors ![[effect-on-depth.jpg]] ## The effect of depth in wider architectures - Similar conclusions when increasing width - Width makes the loss surface even smoother and flatter! ![[effect-on-widt.jpg]] ## The effect of weight decay on optimizer trajectory - Weight decay encourages optimization trajectory perpendicular to isocurves - Turning off weight decay, the optimizer often goes in parallel with isocurves ![[effect-on-optimizers.jpg]] ## Why skip connections make loss surfaces smoother? The gradient with skip connection becomes $ \frac{\partial \mathcal{L}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot\left(\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{x}}\right)=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} $ Which means, the previous layer gradient is carried to the next module untouched. The problem of [[Vanishing and Exploding Gradients]] becomes less problematic. Seen otherwise, the loss surface corresponds to stronger gradients, i.e., smoother. --- ## References 1. Lecture 5.5, UvA DL course 2020 2. Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018