# High-Dimensional Dot Product Normalization In high-dimensional spaces, dot products between random vectors naturally grow in magnitude proportionally to $\sqrt{d_k}$, where $d_k$ is the dimension. This can destabilize training in attention mechanisms and other neural networks. ## Mathematical Explanation Here's the step-by-step breakdown of the variance calculation: **The dot product**: $\mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i \times k_i$ **Expected value**: $E[\mathbf{q} \cdot \mathbf{k}] = E\left[\sum_{i=1}^{d_k} q_i k_i\right] = \sum_{i=1}^{d_k} E[q_i k_i] = \sum_{i=1}^{d_k} E[q_i]E[k_i] = 0$ (since $E[q_i] = E[k_i] = 0$ and components are independent) **Variance**: $\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \text{Var}\left(\sum_{i=1}^{d_k} q_i k_i\right)$ Since components are independent: $= \sum_{i=1}^{d_k} \text{Var}(q_i k_i)$ For **independent random variables with zero mean** (similar to reasoning in [[Weight Initialization#Initializing weights by preserving variance]]): $\text{Var}(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = E[q_i^2]E[k_i^2] - 0$ Since $\text{Var}(q_i) = E[q_i^2] - (E[q_i])^2 = E[q_i^2] = \sigma^2$: $\text{Var}(q_i k_i) = \sigma^2 \times \sigma^2 = \sigma^4$ Therefore: $\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \sum_{i=1}^{d_k} \sigma^4 = d_k \sigma^4$ **Standard deviation**: $\text{SD}(\mathbf{q} \cdot \mathbf{k}) = \sqrt{d_k \sigma^4} = \sigma^2\sqrt{d_k}$ ## The Solution: Scale by $\frac{1}{\sqrt{d_k}}$ Dividing dot products by $\sqrt{d_k}$ normalizes the variance: $\text{Scaled dot product} = \frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}$ This keeps variance constant regardless of dimension: $\text{Var} = \sigma^2$ ## Why This Specific Scaling? - **Too small scaling** (e.g., $\frac{1}{d_k}$): Attention becomes too uniform - **Too large scaling** (e.g., constant): Gradients vanish due to saturation - **Just right** ($\frac{1}{\sqrt{d_k}}$): Maintains optimal softmax sensitivity ## Application in Attention $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$ This scaling ensures stable training across different attention head dimensi