# BLEU
- Bilingual Evaluation Understudy (BLEU), introduced by Roukos and Papineni (2003)
- BLEU is a precision-oriented metric
- How many of the n-grams occurring in the translation occur in any of the reference translations?
- Considers n-grams of several lengths, typically 1 to 4-grams
- N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$
- For each $\mathrm{n}$-gram occurrence $\bar{w}_{i}$ of order $i$ in the translation,
- correct $_{i}++$ and total $_{i}++$ if $\bar{w}_{i}$ occurs in any of the reference translations - total $_{i}++$ : else
- Brevity penalty (BP): Compensates for the tendency of shorter translations having higher precisions
$
\operatorname{BP}\left(l_{t}, l_{r}\right)=\left\{\begin{array}{lll}
1 & \text { if } l_{t} \geq l_{r} & \left(l_{r}=\text { reference length }\right) \\
\exp \left(1-\frac{l_{r}}{l_{t}}\right) & \text { if } l_{t}<l_{r} & \left(l_{t}=\text { translation length }\right)
\end{array}\right.
$
## Example
Translation pair data:
![[BLEU-example.png]]
Candidate translation: the cat sat on a mat.
N-gram Precisions:
- $p(1)=\frac{6}{7}$ : the; cat; sat; on; a; mat; .
- $p(2)=\frac{4}{6}$ : the cat; cat sat; sat on; on a; a mat; mat .
- $p(3)=\frac{3}{5}$ : the cat sat; cat sat on; sat on a; on a mat; a mat .
- $p(4)=\frac{1}{4}$ : the cat sat on; cat sat on $\mathrm{a}$; sat on a mat; on a mat
So BLEU score is calculated as:
$
\begin{aligned}
&\operatorname{BLEU}\left(t, R_{f}\right)=\mathrm{BP}(7,7) \cdot \prod_{i=1}^{n} p(i)^{\frac{1}{n}}=\mathrm{BP} \cdot \frac{6 \frac{1}{7}} \cdot \frac{4^{\frac{1}{4}}}{6} \cdot \frac{3}{5}^{\frac{1}{4}} \cdot \frac{1}{4}^{\frac{1}{4}} \\
&=\mathrm{BP}(7,7) \cdot 0.5411=1 \cdot 0.5411=0.5411
\end{aligned}
$
## Adjustments
- N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$
- Candidate translation $t$ : the the the the the the
- $p(1)=\frac{\text { correct }_{1}}{\text { total }_{1}}=\frac{7}{7}=1$
- Use clipped-counts: Let $\text{ref\_count}\left(\bar{w}_{i}\right)$ be the maximum number of times $\bar{w}_{i}$ occurs in any individual reference and $\text{trans\_count}\left(\bar{w}_{i}\right)$ the number of times $\bar{w}_{i}$ occurs in the translation.
- For each n-gram $\bar{w}_{i}$ of order $i$ in the translation, $\text{correct\_clipped}_{i}+=\min \left(\operatorname{trans\_ count}\left(\bar{w}_{i}\right), \text{ref\_count}\left(\bar{w}_{i}\right)\right)$
- $p(i)=\frac{\text { correct\_clipped }_{i}}{\text { total}_i}$
The brevity penalty compares the length of the translation candidate with the length of the reference translation
If we have multiple reference translations, there are multiple ways to define $l_{r}:$
- Shortest reference
- Average reference length
- Closest reference: The length of the reference which is closest in length to the translation
## Granularity of BLEU
BLEU scores can be computed on the
- sentence-level
- document-level
- corpus-level
Sentence-level BLEU is rather unstable, translation candidates with minor differences can receive very different sentence-level BLEU scores
- Better apply BLEU on the document or corpus level
- BLEU formulation remains unchanged, but all counts and lengths are computed over the entire document or corpus
## Advantages and disadvantages of BLEU
- BLEU generally correlates relatively well with human judgments
- By far the most commonly used MT evaluation metric
- Absolute BLEU scores in isolation are not very meaningful
- Comparison of BLEU scores across language pairs not meaningful
- BLEU is less useful for translating into morphologically rich languages
- Good translations that are dissimilar to all reference translations are penalized
- BLEU is not differentiable!
---
## References