# Machine Translation
MT is an active research of Al since its beginnings. Machine translation (MT) is a nice example of the different paradigm shifts in AI:
- 1950s-1990s: rule-based, symbolic approaches
- 1990s-2016: statistical, data-driven approaches
- 2014-now: neural, deep learning, data-driven approaches
Given a sentence $f$ in a source language, we want to find the target language sentence $e$, such that
$
\underset{e}{\operatorname{argmax}} p(e \mid f)
$
- Normally re-formulated by applying Bayes' theorem:
$
\underset{e}{\operatorname{argmax}} \underbrace{p(f \mid e)}_{\text {TransModel }} \cdot \underbrace{p(e)}_{\text {LangModel }}
$
![[NMT General Architecture.png]]
## Human Evaluation of MT
Typically measured along two dimensions:
- Adequacy: How much of the meaning of the original sentence is preserved?
- Fluency: How fluent (and grammatical) is the translation?
- Measured along Likert scales
![[MT Human Evaluation.png]]
For automatic evaluation, [[BLEU]] is used.
## Neural Machine Translation vs Statistical Machine Translation
Statistical machine translation (SMT)
- collects co-occurrence statistics of phrase translations
- collects statistics of target n-gram (language modeling)
- collects reordering statistics between pairs of phrase translations
- explores a vast search space the order in which matching phrase translations should be applied to generate a target sentence (decoding)
- there is no global representation of the foreign sentence!
Neural machine translation (NMT)
- builds a continuous representation of the foreign sentence (encoder)
- given that representation a target sentence is generated (decoder)
## Low resource NMT
The quality of an NMT system depends to a large degree on
- the amount of data
- the amount of variation within the data
- the relevance of the data for the actual task
In low-resource NMT some or several of those conditions are not met. Typical low-resource problems include
- Domain adaptation in NMT
- NMT for low-resource language pairs
The problem of universal MT can be approached from two angles:
Data angle
- increase amount of parallel data (web crawling, crowd-sourcing)
- better utilization of existing parallel data
- automatically generate synthetic parallel data
Model angle
- better optimization of existing low-resource systems
- transfer knowledge between high- and low-resource languages
- joint training of all language pairs (multilingual NMT)
### Back-Translation
- Given a parallel corpus $(S, T)$ and additional monolingual data in the target language $\left(T^{\prime}\right)$
- Two settings:
- dummy source: pair each target sentence with a dummy (empty) source sentence $\left(D, T^{\prime}\right)$
- back-translate: train a system on $T \rightarrow S$ and back-translate each target sentence in $T^{\prime}$ creating a synthetic parallel corpus $\left(S^{*}, T^{\prime}\right)$
- Both forward- and back-translation yield improvements
- Model benefits more from fluent target sentences, i.e., back-translation
- Does the quality of $S^{*}$, i.e., the quality of the $T \rightarrow S$ NMT system, matter?
- NMT translation only slightly worse than human translation
- But, a rather poor NMT system leads to degradations
- Taking prediction loss into account consistently outperforms random selection and favoring low-frequency words
### Transfer Learning
- basically the same as fine tuning
- train an NMT model on a language pair with large amounts of data (parent)
- continue training this model on the low-resource language pair (child)
Transfer learning for NMT works best if
- target language of parent and child are identical
- source language of parent and child are related
- decoder parameters are frozen
- child and parent languages are made to share the same vocabulary
### Word segments
- One of the main bottlenecks of training an NMT system is the vocabulary
The extent to which a word is correctly represented depends on - its frequency - the number of different contexts in which it occurs
- Many word occurrences are the result of word formations.
- inflections
- compounding
- Instead of whole words, use sub-words where a subword is
- a character n-gram
- a morphologically meaningful unit (language dependent)
Two common approaches to split words in subsegments: - Sennrich et al. (2016): Byte Pair Encoding (BPE) - Schuster et al. (2012): Wordpieces
- Both approaches split words into segments based on frequencies (no linguistic knowledge)
In BPE:
1. Split words into characters (keep word boundaries) and collect frequencies
2. Consider all neighboring occurrences and collect frequencies
3. Merge the most frequent sequence and repeat from (2)
Stop when a maximum number of merge operations has been reached
BPE has three major advantages
- significantly reduces the vocabulary size (less memory, better speed)
- results in better translation quality
- significantly reduces the number of out-of-vocabulary $(O O V)$ items
- These benefits apply to both high- and low-resource language pairs
Nowadays, BPE is typically used with 30,000 merge operations
BPE with a radically reduced vocabulary size (e.g., 2,000) in combination with careful hyper-parameter tuning can lead to large improvements for low-resource translation
---
## References