# Beam Decoding
Using the model's actual predictions of earlier time steps to predict the next output requires us to explore a search space
- This is required when using any sequence-to-sequence model (image captioning, NMT, summarization, ...)
- Greedy decoding suffers from sub-optimal decisions leading to even more sub-optimal decisions later on
- Beam Decoding: explore multiple alternatives in parallel
Example:
![[Beam Decoding process.png]]
- When do we stop? Limit length to $\lfloor r \cdot m\rfloor$, where $m$ is the length of the source sentence and $r$ is some length ratio, e.g., $r=1.2$
- All translation candidates $C(Z)$ are compared and the most likely one is returned
- $\hat{Y}_{\leq t}$ is $\in C(Z)$ if $\hat{y}_{t}=</ \mathrm{s}>$ or $t=\lfloor r \cdot m\rfloor$
- candidate score $s\left(\hat{Y}_{\leq t} \mid Z\right)=\frac{1}{t} \log p\left(\hat{Y}_{\leq t} \mid Z\right)$ normalize to compare candidates of different lengths
- return $\hat{Y}=\underset{\hat{Y}_{\leq t} \in C(Z)}{\operatorname{argmax}} s\left(\hat{Y}_{\leq t} \mid Z\right)$
Fast training is important, but fast beam decoding is critical for actual applications
- Computing $\log p\left(\hat{Y}_{\leq t+1} \mid Z\right)$ directly for each $t$ is inefficient
- but does not require access to relevant network activations
- Better: $\log p\left(\hat{Y}_{\leq t+1} \mid Z\right)=\log p\left(\hat{y}_{t+1} \mid \hat{Y}_{\leq t}, Z\right)+\log p\left(\hat{Y}_{\leq t} \mid Z\right)$
- requires access to the relevant layer activations from $n e t_{t}$ to compute $n e t_{t+1}$
- e.g., to compute $\mathrm{LSTM}_{t+1}$ the activations $\left(\mathbf{h}_{t}, \mathbf{c}_{t}\right)$ have to be exposed (not all packages directly expose LSTM cells)
- Beam decoding is always slower than training due to frequent interactions with the CPU
- Beam decoding performs better than greedy decoding
- beam sizes beyond 25 often lead to decreases in quality
- commonly used beam sizes vary between 5 and 12
- Beams can be represented as entries (rows) in a batch
---
## References