# BLEU - Bilingual Evaluation Understudy (BLEU), introduced by Roukos and Papineni (2003) - BLEU is a precision-oriented metric - How many of the n-grams occurring in the translation occur in any of the reference translations? - Considers n-grams of several lengths, typically 1 to 4-grams - N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$ - For each $\mathrm{n}$-gram occurrence $\bar{w}_{i}$ of order $i$ in the translation, - correct $_{i}++$ and total $_{i}++$ if $\bar{w}_{i}$ occurs in any of the reference translations - total $_{i}++$ : else - Brevity penalty (BP): Compensates for the tendency of shorter translations having higher precisions $ \operatorname{BP}\left(l_{t}, l_{r}\right)=\left\{\begin{array}{lll} 1 & \text { if } l_{t} \geq l_{r} & \left(l_{r}=\text { reference length }\right) \\ \exp \left(1-\frac{l_{r}}{l_{t}}\right) & \text { if } l_{t}<l_{r} & \left(l_{t}=\text { translation length }\right) \end{array}\right. $ ## Example Translation pair data: ![[BLEU-example.png]] Candidate translation: the cat sat on a mat. N-gram Precisions: - $p(1)=\frac{6}{7}$ : the; cat; sat; on; a; mat; . - $p(2)=\frac{4}{6}$ : the cat; cat sat; sat on; on a; a mat; mat . - $p(3)=\frac{3}{5}$ : the cat sat; cat sat on; sat on a; on a mat; a mat . - $p(4)=\frac{1}{4}$ : the cat sat on; cat sat on $\mathrm{a}$; sat on a mat; on a mat So BLEU score is calculated as: $ \begin{aligned} &\operatorname{BLEU}\left(t, R_{f}\right)=\mathrm{BP}(7,7) \cdot \prod_{i=1}^{n} p(i)^{\frac{1}{n}}=\mathrm{BP} \cdot \frac{6 \frac{1}{7}} \cdot \frac{4^{\frac{1}{4}}}{6} \cdot \frac{3}{5}^{\frac{1}{4}} \cdot \frac{1}{4}^{\frac{1}{4}} \\ &=\mathrm{BP}(7,7) \cdot 0.5411=1 \cdot 0.5411=0.5411 \end{aligned} $ ## Adjustments - N-gram precision: $p(i)=\frac{\text { correct }_{i}}{\text { total }_{i}}$ - Candidate translation $t$ : the the the the the the - $p(1)=\frac{\text { correct }_{1}}{\text { total }_{1}}=\frac{7}{7}=1$ - Use clipped-counts: Let $\text{ref\_count}\left(\bar{w}_{i}\right)$ be the maximum number of times $\bar{w}_{i}$ occurs in any individual reference and $\text{trans\_count}\left(\bar{w}_{i}\right)$ the number of times $\bar{w}_{i}$ occurs in the translation. - For each n-gram $\bar{w}_{i}$ of order $i$ in the translation, $\text{correct\_clipped}_{i}+=\min \left(\operatorname{trans\_ count}\left(\bar{w}_{i}\right), \text{ref\_count}\left(\bar{w}_{i}\right)\right)$ - $p(i)=\frac{\text { correct\_clipped }_{i}}{\text { total}_i}$ The brevity penalty compares the length of the translation candidate with the length of the reference translation If we have multiple reference translations, there are multiple ways to define $l_{r}:$ - Shortest reference - Average reference length - Closest reference: The length of the reference which is closest in length to the translation ## Granularity of BLEU BLEU scores can be computed on the - sentence-level - document-level - corpus-level Sentence-level BLEU is rather unstable, translation candidates with minor differences can receive very different sentence-level BLEU scores - Better apply BLEU on the document or corpus level - BLEU formulation remains unchanged, but all counts and lengths are computed over the entire document or corpus ## Advantages and disadvantages of BLEU - BLEU generally correlates relatively well with human judgments - By far the most commonly used MT evaluation metric - Absolute BLEU scores in isolation are not very meaningful - Comparison of BLEU scores across language pairs not meaningful - BLEU is less useful for translating into morphologically rich languages - Good translations that are dissimilar to all reference translations are penalized - BLEU is not differentiable! --- ## References