# Positional Encoding Positional encoding adds a time signal to the input. This is necessary because unlike RNNs, there is no recurrence built into transformers which carries positional information in the network architecture itself. Embeddings could be learnt, but doesn't generalize well to positions not seen in the training data. First idea would be to simply assign a number to each time-step i.e. first element is 1, second 2 and so on. But the problem is that it is difficult for the model to associate it with the length. Second idea could to be add a feature dimension in the embedding which assigns the timestep for the vector in range [0,1], 0 being first and 1 being the last word. Problem - these time-step deltas don't have consistent meaning across different sentences. The ideal criteria for positional information encoding are: 1. It should output a unique encoding for each time-step 2. Distance between any two time-steps should be consistent across sequences with different lengths. 3. Our model should generalize to longer sentences without any efforts. Its values should be bounded. 4. It must be deterministic. The proposed technique in the original paper is to generate a $d$ dimensional vector PE such that $d_{\text {word embedding }}=d_{\text {postional embedding }}$ (one for each element of the sequence) and add it to the input vectors, defined as $ P E_{(p o s, i)}=\left\{\begin{array}{ll} \sin \left(\frac{p o s}{10000^{i d_{\text {model }}}}\right) & \text { if } i \bmod 2=0 \\ \cos \left(\frac{p o s}{10000^{j d_{\text {thotel }}}}\right) & \text { otherwise } \end{array}\right. $ Note that encoding vectors are not learned as a part of the model, it's a fixed function. ```python # Create matrix of [SeqLen, HiddenDim] representing the positional encoding for max_len inputs pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) x = x + pe[:, :x.size(1)] ``` This sinusoidal positional embedding technique, quite amazingly, satisfies all the ideal criterias. However, how does sine and cosine functions represent positions? A nice intuition of this comes from visualizing positional vector as representing the postion of clock hands rotating at different frequencies i.e. first element is hour, second is minutes, third is seconds and so on. ![[positional-encoding.svg]] We can visualize PE vectors by plotting them as below: ![[positional_encoding128.png]] Another characteristic of sinusoidal position encoding is that it allows the model to attend to relative positions effortlessly, since for any fixed offset $\phi$, $PE_{t + \phi}$ can be represented as a linear function of $PE _t$. $ M .\left[\begin{array}{l} \sin \left(\omega_{k} \cdot t\right) \\ \cos \left(\omega_{k} \cdot t\right) \end{array}\right]=\left[\begin{array}{l} \sin \left(\omega_{k} \cdot(t+\phi)\right) \\ \cos \left(\omega_{k} \cdot(t+\phi)\right) \end{array}\right] $ It turns out, M represents a rotation matrix. Yet another nice property of sinusoidal positioning is that the distance between neighboring time steps are symmetrical and decays with time, as can be seen below (dot products of positional embedding). ![[time-steps_dot_product.png]] --- ## References 1. Positional Encoding by Amirhossein Kazemnejad https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ - Explains the math and intuition behind postional encoding