RMSNorm - Notes on AI

# RMSNorm [Zhang at al 2019](https://arxiv.org/abs/1910.07467) proposed RMSNorm as a more efficient alternative by only focusing on re-scaling invariance and regularizing the summed inputs simply according to the root mean square (RMS) statistic: $ \bar{a}_i=\frac{a_i}{\operatorname{RMS}(\mathbf{a})} g_i, \quad \text { where } \operatorname{RMS}(\mathbf{a})=\sqrt{\frac{1}{n} \sum_{i=1}^n a_i^2} . $ where [[Layer Normalization]] is defined as $\bar{a}_i=\frac{a_i-\mu}{\sigma} g_i, \quad y_i=f\left(\bar{a}_i+b_i\right)$ where $\mu=\frac{1}{n} \sum_{i=1}^n a_i, \quad \sigma=\sqrt{\frac{1}{n} \sum_{i=1}^n\left(a_i-\mu\right)^2}$. Intuitively, RMSNorm simplifies LayerNorm by totally removing the mean statistic at the cost of sacrificing the invariance that mean normalization affords. When the mean of summed inputs is zero, RMSNorm is exactly equal to LayerNorm. Experiments on several NLP tasks show that RMSNorm is comparable to LayerNorm in quality, but **accelerates the running speed**. Actual speed improvements depend on the framework, hardware, neural network architecture and relative computational cost of other components, and we empirically observed speedups of $7 \% \sim 64 \%$ across different models and implementations. The efficiency improvement come from simplifying the computation.