# Grouped Query Attention (GQA)
GQA trades off [[Autoregressive Generation and KV Caching in Transformers|KV cache]] size for model quality. In [[Self-Attention Mechanism#Multi Head Attention (MHA)]] each attention head has its own K and V. In GQA, multiple query heads share the same K and V.
```
MHA (8 heads):
Q1→K1,V1 Q2→K2,V2 Q3→K3,V3 Q4→K4,V4 Q5→K5,V5 Q6→K6,V6 Q7→K7,V7 Q8→K8,V8
GQA (8 heads, 2 KV groups):
Q1→K1,V1 Q2→K1,V1 Q3→K1,V1 Q4→K1,V1 Q5→K2,V2 Q6→K2,V2 Q7→K2,V2 Q8→K2,V2
```
Why does it work? We don't know, mostly empirical reasoning, with ablations showing it works without significantly quality loss.
Common hypothesis goes like this: K and V have more redundancy across heads than Q. Different query heads can ask different questions ("is this a noun?", "is this related to the subject?"), but they're all querying the same underlying content. So you don't need as many ways to describe that content.
## KV Cache Compression
|Method|KV heads|Cache per token per layer|
|---|---|---|
|MHA|$h$|$h \times d_{head} \times 2$|
|GQA|$g$|$g \times d_{head} \times 2$|
Compression ratio: $h / g$
**Popular model configurations:**
|Model|Q heads|KV heads|Ratio|Cache reduction|
|---|---|---|---|---|
|Llama 3 8B|32|8|4:1|75%|
|Llama 3 70B|64|8|8:1|87.5%|
|Llama 3 405B|128|8|16:1|93.75%|
GQA provides significant cache reduction while ablation studies show performance close to full MHA, making it the standard for modern LLMs.