Rotary Position Embeddings (RoPE)

# Rotary Position Embeddings (RoPE) Instead of adding [[Positional Encoding]] vectors to input for encoding token position, RoPE encodes position by rotating Q and K vectors. Each position $i$ has a rotation matrix $R_i$: $Q_i = (X_i W_Q) R_i$ $K_j = (X_j W_K) R_j$ When computing attention between positions $i$ and $j$: $Q_i K_j^T = (X_i W_Q) R_i R_j^T (X_j W_K)^T$ The key property: $R_i R_j^T$ depends only on the relative position $(i - j)$, not absolute positions. This comes from trig identities—terms like $\cos(i\theta)\cos(j\theta) + \sin(i\theta)\sin(j\theta)$ simplify to $\cos((i-j)\theta)$. **What's the rotation?** RoPE rotates pairs of dimensions. For a 2D case at position $i$: $R_i = \begin{pmatrix} \cos(i\theta) & -\sin(i\theta) \ \sin(i\theta) & \cos(i\theta) \end{pmatrix}$ For higher dimensions, this is applied to pairs of dimensions with different frequencies $\theta$, allowing the model to capture both fine-grained (high frequency) and long-range (low frequency) positional information. RoPE applies an absolute rotation per position, but the dot product captures relative distance—unifying absolute and relative approaches without explicitly computing or storing relative biases.