Grouped Query Attention (GQA)

# Grouped Query Attention (GQA) GQA trades off [[Autoregressive Generation and KV Caching in Transformers|KV cache]] size for model quality. In [[Self-Attention Mechanism#Multi Head Attention (MHA)]] each attention head has its own K and V. In GQA, multiple query heads share the same K and V. ``` MHA (8 heads): Q1→K1,V1 Q2→K2,V2 Q3→K3,V3 Q4→K4,V4 Q5→K5,V5 Q6→K6,V6 Q7→K7,V7 Q8→K8,V8 GQA (8 heads, 2 KV groups): Q1→K1,V1 Q2→K1,V1 Q3→K1,V1 Q4→K1,V1 Q5→K2,V2 Q6→K2,V2 Q7→K2,V2 Q8→K2,V2 ``` Why does it work? We don't know, mostly empirical reasoning, with ablations showing it works without significantly quality loss. Common hypothesis goes like this: K and V have more redundancy across heads than Q. Different query heads can ask different questions ("is this a noun?", "is this related to the subject?"), but they're all querying the same underlying content. So you don't need as many ways to describe that content. ## KV Cache Compression |Method|KV heads|Cache per token per layer| |---|---|---| |MHA|$h$|$h \times d_{head} \times 2$| |GQA|$g$|$g \times d_{head} \times 2$| Compression ratio: $h / g$ **Popular model configurations:** |Model|Q heads|KV heads|Ratio|Cache reduction| |---|---|---|---|---| |Llama 3 8B|32|8|4:1|75%| |Llama 3 70B|64|8|8:1|87.5%| |Llama 3 405B|128|8|16:1|93.75%| GQA provides significant cache reduction while ablation studies show performance close to full MHA, making it the standard for modern LLMs.