# Normalization Activation functions usually "centered" around 0. Propagating to next layer, the mean value remains roughly 0, meaning there is no shift with depth Also, centering to 0 is important for training as often the strongest gradients are around x=0. Assume: Input variables follow a Gaussian distribution (roughly) Subtract input by the mean Optionally, divide by the standard deviation $ N\left(\mu, \sigma^{2}\right) \rightarrow N(0,1) $ But what do we do for intermediate layers? ![[normalization.png]] [Image Source](https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/) ## Batch normalization Input distributions change for per layer, especially during training Normalize the layer inputs with batch normalization Normalize $a_{l} \sim N(0,1)$ Followed by affine transformation $a_{l} \leftarrow \gamma a_{l}+\beta$, where parameters $\gamma$ and $\beta$ are trainable. i runs over mini-batch samples, j over the feature dimensions $ \begin{array}{l} \mu_{j} \leftarrow \frac{1}{m} \sum_{i=1}^{m} x_{i j} \\ \sigma_{j}^{2} \leftarrow \frac{1}{m} \sum_{i=1}^{m}\left(x_{i j}-\mu_{j}\right)^{2} \\ \hat{x}_{i j} \leftarrow \frac{x_{i j}-\mu_{j}}{\sqrt{\sigma_{j}^{2}+\varepsilon}} \\ \hat{x}_{i j} \leftarrow \gamma \hat{x}_{i j}+\beta \end{array} $ ### Why does it work? Covariate shift - Per gradient update a module must adapt the weights to fit better the data, but also adapt to the change of its input distribution. Remember, each module inputs depend on other parameterized modules. ![[batch-norm.jpg]] This interpretation doesn't explain practical observations: 1. Why does batch norm work better after the nonlinearity? 2. Why have $\gamma$ and $\beta$ parameters to reshape our gaussian if the problem is the covariate shift? Original reasoning is it gives that choice to the model itself. There is another interpretation: Batch norm simplifies the learnin dynamics. Neural network outputs determined by higher order layer interactions They complicate the gradient update Mean of BatchNorm output is $\beta$, std is $\gamma$ They are independent of the activation values themselves Higher order interactions suppressed, training becomes easier ### The benefits 1. Higher learning rates, which means faster training 2. Neurons of all layers activated in near optimal "regime" 3. Model regularization 1. Add some noise to per mini-batch mean and variance 2. The added noise reduces overfitting ### Test inference How do we ship the Batch Norm layer after training? We might not have batches at test time. Usually: keep a moving average of the mean and variance during training, plug them in at test time. To the limit, the moving average of mini-batch statistics approaches the batch statistics. ### Disadvantages Requires large mini-batches Cannot work with mini-batch of size $1(\sigma=0)$ And for small mini-batches we don't get very accurate gradients anyways Awkward to use with recurrent neural networks - Must interleave it between recurrent layers - Also, store statistics per time step - Can cause gradient explosion as well: https://arxiv.org/abs/2304.11692 ## Layer Normalization [[Layer Normalization]] ## Instance Normalization Similar to layer normalization but per channel per training example Basic idea: network should be agnostic to the contrast of the original image Originally proposed for style transfer Not as good in image classification ## Group Normalization Same as instance norm but over groups of channels Between layer normalization and instance normalization. Better than batch normalization for small batches, for example < 32. Competitive for larger batches. Useful for object detection/segmentation networks. They rely on high resolution images and cannot have big mini-batches. ## Weight Normalization Instead of normalizing activations, normalize weights by re-parameterizing weights $ \boldsymbol{w}=g \frac{\boldsymbol{v}}{\|\boldsymbol{v}\|} $ Separate the norm from the direction. Similar to dividing by standard deviation in batch normalization. Can be combined with mean-only batch normalization: Subtract the mean (but not divide by the standard deviation). Then, apply weight normalization. --- ## References 1. Lecture 3.3, UvA DL 2020