# High-Dimensional Dot Product Normalization
In high-dimensional spaces, dot products between random vectors naturally grow in magnitude proportionally to $\sqrt{d_k}$, where $d_k$ is the dimension. This can destabilize training in attention mechanisms and other neural networks.
## Mathematical Explanation
Here's the step-by-step breakdown of the variance calculation:
**The dot product**: $\mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i \times k_i$
**Expected value**:
$E[\mathbf{q} \cdot \mathbf{k}] = E\left[\sum_{i=1}^{d_k} q_i k_i\right] = \sum_{i=1}^{d_k} E[q_i k_i] = \sum_{i=1}^{d_k} E[q_i]E[k_i] = 0$
(since $E[q_i] = E[k_i] = 0$ and components are independent)
**Variance**:
$\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \text{Var}\left(\sum_{i=1}^{d_k} q_i k_i\right)$
Since components are independent:
$= \sum_{i=1}^{d_k} \text{Var}(q_i k_i)$
For **independent random variables with zero mean** (similar to reasoning in [[Weight Initialization#Initializing weights by preserving variance]]):
$\text{Var}(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = E[q_i^2]E[k_i^2] - 0$
Since $\text{Var}(q_i) = E[q_i^2] - (E[q_i])^2 = E[q_i^2] = \sigma^2$:
$\text{Var}(q_i k_i) = \sigma^2 \times \sigma^2 = \sigma^4$
Therefore:
$\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \sum_{i=1}^{d_k} \sigma^4 = d_k \sigma^4$
**Standard deviation**:
$\text{SD}(\mathbf{q} \cdot \mathbf{k}) = \sqrt{d_k \sigma^4} = \sigma^2\sqrt{d_k}$
## The Solution: Scale by $\frac{1}{\sqrt{d_k}}$
Dividing dot products by $\sqrt{d_k}$ normalizes the variance:
$\text{Scaled dot product} = \frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}$
This keeps variance constant regardless of dimension: $\text{Var} = \sigma^2$
## Why This Specific Scaling?
- **Too small scaling** (e.g., $\frac{1}{d_k}$): Attention becomes too uniform
- **Too large scaling** (e.g., constant): Gradients vanish due to saturation
- **Just right** ($\frac{1}{\sqrt{d_k}}$): Maintains optimal softmax sensitivity
## Application in Attention
$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$
This scaling ensures stable training across different attention head dimensi