# Dropout
During training randomly set activations to 0. Neurons sampled at random from a Bernoulli distribution with dropout rate hyperparameter, example $p=0.5$.
During testing all neurons are used, neuron activations reweighted by $p$. Why?
During testing and validation all neurons will be active. Since they will then get more activations as input, their input values would increase by a factor of 1/(1-p) once we switch from training to testing. The weights of the neuron would be adapted to smaller inputs though, so that the activation might saturate (tanh) or give significantly larger outputs (ReLU). To avoid this, we can already rescale the activations during training, actively amplifying the activations which do not get dropped, in order to provide an input with a consistent magnitude over the course of both training and testing.
Benefits
1. Reduces complex co-adaptations or co-dependencies between neurons.
2. Every neuron becomes more robust.
3. Decreases overfitting.
Effectively, dropout enforces a different architecture for every input batch during training, meaning it is similar to model ensembles.
Usuallly only applied to the fully connected layers, not convolutional layers.
## Dropout rate
Start with a relatively small rate, like 20-50%.
If too high, your network will underfit.
With dropout you can also try larger neural networks.
## Variational Dropout
![[Variational Dropout.png]]
Gal and Ghahramani (2016) introduce variational RNN Dropout
- uses the same dropout mask for each time step
- between RNN time steps and also between horizontal layers
- same colors indicate same dropout mask (dashed=no dropout)
---
## References
1. Chapter 3.4, UvA DL 2020