# Transformers [[Scaling Transformers]] Sequence to sequence models rely on an [[Recurrent Neural Networks (RNN)]] encoder to generate a fixed sized latent vector and a decoder to generate corresponding sequences one timestep at a time. Works well, but has following problems: 1. RRNs are slow to train, have to use [[Backpropagation Through Time]]]] ([[LSTM]] cells even slower) 2. Difficulty with long sequences ([[Vanishing and Exploding Gradients]], long term dependancy, not truly bidirectional) In Transformer there is no concept of timesteps for the input as it gets rid of the recurrence mechanism altogether. Sequence is passed in parallel. This notion of order of sequence is achieved by [[Positional Encoding]]. This allows for massive parallelization. And can capture long term dependencies with no problem. ## The Transformer Block Transformer block applies, in sequence: an [[Attention Mechanism#Scaled Dot Product Attention]] layer, [[Layer Normalization]], a feed forward layer (single MLP independantly to each vector/element of the sequence), and another layer normalization. Residual connections are added before normalization. Normalization and residual connections are the standard methods of scaling deep neural networks by stablizing the gradients. Why they help: https://arxiv.org/abs/2305.02582 Layer normalization is used instead of batch normalization as in NLP batches can very in size as sequences can be of different lengths. ![[transformer-block.png]] Transformer blocks can be stacked to create depth. ## Transformer Decoder - Transformer decoder architecture is very similar to encoder, but - needs to integrate decoder-encoder attention - self-attention has to be limited to previous time steps using masking ![[Transformer Decoder.png]] ## Training Transformers - Training transformer models is non-trivial - easily diverges leading to non-recoverable loss explosions Tips and tricks: - batch size: use large batch sizes of at least 4,000 tokens $(\approx 160$ sentences), bigger is better: e.g., 25,000 tokens $(\approx 1,000$ sentences) learning rate: use a learning rate schedule with increasing and decreasing rate: lr = f_base x f_warmup x f_rsqrt_decay x f_model_size - $\mathrm{f}_{\text {base }}=$ pre-defined constant, e.g., $2.0$ - $\mathrm{f}_{\text {warmup }}(t, w)=\min (1.0, t / w)$ - $t$ is the current update step, $w$ is the number of (pre-defined) warm-up steps - $\mathrm{f}_{\text {rsqrtdecay }}(t, w)=\max (t, w)^{-0.5}$ - $\mathrm{f}_{\text {modelsize }}(m)=m^{-0.5}$ - $m$ is the size of the hidden layers ![[Transformer LR schedule.png]] ## Challenges 1. Quadratic cost to input length 1. Transformers XL: divide sequences hierarchically 2. Sparse transformers: self attention on specific pairs 2. Memory intense 1. Half-precision variables (except for loss) 2. Gradient accumulation 3. Gradient checkpointing --- ## References 1. Tutorial 6, UvA DL course 2020 - https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html - Code snippets for attention, positional encoding 2. Transformer Neural Networks Video by CodeEmporium https://www.youtube.com/watch?v=TQQlZhbC5ps - High level understanding of main ideas of Transformer 4. The Illustrated Transformer https://jalammar.github.io/illustrated-transformer/ - A great introduction to Transformer with helpful illustrations to build intuition 5. Transformers from scratch http://www.peterbloem.nl/blog/transformers - Extremely good post that goes into much depth than [4]