Weight Initialization

# Weight Initialization Weight initialization is tricky because there are a few contradictory requirements. 1. Weights need to be small enough, otherwise output values explode. 2. Weights need to be large enough, otherwise signal too weak to propagate. Around origin ( 0 ) for symmetric functions (tanh, sigmoid) - In early training stimulate activation functions near their linear regime - Larger gradients result in faster training ![[weight-init.jpg]] Weights should be initialized to preserve the variance of the activations/gradients, to ensure consistent behaviour across layers and that each layer can learn effectively. If you have layers L1 → L2 → L3 → L4, you want: - Forward pass: var(input to L1) ≈ var(input to L2) ≈ var(input to L3) ≈ var(input to L4) - Backward pass: var(gradient at L4) ≈ var(gradient at L3) ≈ var(gradient at L2) ≈ var(gradient at L1) Weights must be initialized to be different from one another. Don't give same values to all weights (like all 0) All neurons generate same gradient means no learning. Generally weight initialization depends on non-linearities and data-normalization. Bad initialization can cause problems, not just slow initial phase of training. Initializing weights in every layer with same constant variance can diminish variance in activations. ![[weight-init-constant-variance.jpg]] Initializing weights in every layer with increasing variance can explode the variance in activations. ![[weight-init-increasing-variance.jpg]] ## Initializing weights by preserving variance For $x$ and $y$ independent $ \operatorname{var}(x y)=\mathbb{E}[x]^{2} \operatorname{var}(y)+\mathbb{E}[y]^{2} \operatorname{var}(x)+\operatorname{var}(x) \operatorname{var}(y) $ For $a=w x \Rightarrow \operatorname{var}(a)=\operatorname{var}\left(\Sigma_{i} w_{i} x_{i}\right)=\sum_{i} \operatorname{var}\left(w_{i} x_{i}\right) \approx d \cdot \operatorname{var}\left(w_{i} x_{i}\right)$ $ \begin{aligned} \operatorname{var}\left(w_{i} x_{i}\right) &=\mathbb{E}\left[x_{i}\right]^{2} \operatorname{var}\left(w_{i}\right)+\mathbb{E}\left[w_{i}\right]^{2} \operatorname{var}\left(x_{i}\right)+\operatorname{var}\left(x_{i}\right) \operatorname{var}\left(w_{i}\right) \\ &=\operatorname{var}\left(x_{i}\right) \operatorname{var}\left(w_{i}\right) \end{aligned} $ because we assume that $x_{i}, w_{i}$ are unit gaussians $\rightarrow \mathbb{E}\left[x_{i}\right]=\mathbb{E}\left[w_{i}\right]=0$ So, the variance in our activation $\operatorname{var}(a) \approx d \cdot \operatorname{var}\left(x_{i}\right) \operatorname{var}\left(w_{i}\right)$ Since we want the same input and output variance $ \operatorname{var}(a)=d \cdot \operatorname{var}\left(x_{i}\right) \operatorname{var}\left(w_{i}\right) \Rightarrow \operatorname{var}\left(w_{i}\right)=\frac{1}{d} $ Which means, we draw random weights from $w \sim \mathcal{N}(0,1 / d)$ where $d$ is the number of input variables to the layer. [[#^4f7184|2]] This is known as Glorot's Uniform or Xavier initialization. It works well for sigmoidal activations. ## Initialization for ReLUs Unlike sigmoidals, ReLUs return 0 half of the time. So, $\mathbb{E}\left[w_{i}\right]=0$ but $\mathbb{E}\left[x_{i}\right] \neq 0$ Redoing the computations, $\operatorname{var}\left(w_{i} x_{i}\right)=\operatorname{var}\left(w_{i}\right)\left(\mathbb{E}\left[x_{i}\right]^{2}+\operatorname{var}\left(x_{i}\right)\right)=\operatorname{var}\left(w_{i}\right) \mathbb{E}\left[x_{i}^{2}\right]\left(\operatorname{var}(X)=\mathbb{E}\left[X^{2}\right]-\mathbb{E}[X]^{2}\right)$ Let's make the assumption that input is the output from previous ReLU layer. Then, $ \begin{array}{r} \mathbb{E}\left[x_{i}^{2}\right]=\int_{-\infty}^{\infty} x_{i}^{2} p\left(x_{i}\right) d x_{i}=\int_{-\infty}^{\infty} \max \left(0, a_{i}\right)^{2} p\left(a_{i}\right) d a_{i}=\int_{0}^{\infty} a_{i}^{2} p\left(a_{i}\right) d a_{i} \\ \quad=0.5 \int_{-\infty}^{\infty} a_{i}^{2} p\left(a_{i}\right) d y_{i}=0.5 \cdot \mathbb{E}\left[a_{i}^{2}\right]=0.5 \cdot \operatorname{var}\left(a_{i}\right) \end{array} $ Therefore, we draw random weights from $\mathrm{W} \sim \mathcal{N}(0,2 / d)$ where $d$ is the number of input variables to the layer [[#^8f4e88|3]]. This is known as Kaiming initialization. --- ## References 1. Lecture 3.2, UvA DL course 2020 2. Diving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, He, Zhang, Ren, Sun, 2015 ^4f7184 3. Understanding the difficulty of training deep feedforward neural networks, Glorot, Bengio, 2010 ^8f4e88 4. Helpful post: https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c7