Machine Translation - Notes on AI

# Machine Translation MT is an active research of Al since its beginnings. Machine translation (MT) is a nice example of the different paradigm shifts in AI: - 1950s-1990s: rule-based, symbolic approaches - 1990s-2016: statistical, data-driven approaches - 2014-now: neural, deep learning, data-driven approaches Given a sentence $f$ in a source language, we want to find the target language sentence $e$, such that $ \underset{e}{\operatorname{argmax}} p(e \mid f) $ - Normally re-formulated by applying Bayes' theorem: $ \underset{e}{\operatorname{argmax}} \underbrace{p(f \mid e)}_{\text {TransModel }} \cdot \underbrace{p(e)}_{\text {LangModel }} $ ![[NMT General Architecture.png]] ## Human Evaluation of MT Typically measured along two dimensions: - Adequacy: How much of the meaning of the original sentence is preserved? - Fluency: How fluent (and grammatical) is the translation? - Measured along Likert scales ![[MT Human Evaluation.png]] For automatic evaluation, [[BLEU]] is used. ## Neural Machine Translation vs Statistical Machine Translation Statistical machine translation (SMT) - collects co-occurrence statistics of phrase translations - collects statistics of target n-gram (language modeling) - collects reordering statistics between pairs of phrase translations - explores a vast search space the order in which matching phrase translations should be applied to generate a target sentence (decoding) - there is no global representation of the foreign sentence! Neural machine translation (NMT) - builds a continuous representation of the foreign sentence (encoder) - given that representation a target sentence is generated (decoder) ## Low resource NMT The quality of an NMT system depends to a large degree on - the amount of data - the amount of variation within the data - the relevance of the data for the actual task In low-resource NMT some or several of those conditions are not met. Typical low-resource problems include - Domain adaptation in NMT - NMT for low-resource language pairs The problem of universal MT can be approached from two angles: Data angle - increase amount of parallel data (web crawling, crowd-sourcing) - better utilization of existing parallel data - automatically generate synthetic parallel data Model angle - better optimization of existing low-resource systems - transfer knowledge between high- and low-resource languages - joint training of all language pairs (multilingual NMT) ### Back-Translation - Given a parallel corpus $(S, T)$ and additional monolingual data in the target language $\left(T^{\prime}\right)$ - Two settings: - dummy source: pair each target sentence with a dummy (empty) source sentence $\left(D, T^{\prime}\right)$ - back-translate: train a system on $T \rightarrow S$ and back-translate each target sentence in $T^{\prime}$ creating a synthetic parallel corpus $\left(S^{*}, T^{\prime}\right)$ - Both forward- and back-translation yield improvements - Model benefits more from fluent target sentences, i.e., back-translation - Does the quality of $S^{*}$, i.e., the quality of the $T \rightarrow S$ NMT system, matter? - NMT translation only slightly worse than human translation - But, a rather poor NMT system leads to degradations - Taking prediction loss into account consistently outperforms random selection and favoring low-frequency words ### Transfer Learning - basically the same as fine tuning - train an NMT model on a language pair with large amounts of data (parent) - continue training this model on the low-resource language pair (child) Transfer learning for NMT works best if - target language of parent and child are identical - source language of parent and child are related - decoder parameters are frozen - child and parent languages are made to share the same vocabulary ### Word segments - One of the main bottlenecks of training an NMT system is the vocabulary The extent to which a word is correctly represented depends on - its frequency - the number of different contexts in which it occurs - Many word occurrences are the result of word formations. - inflections - compounding - Instead of whole words, use sub-words where a subword is - a character n-gram - a morphologically meaningful unit (language dependent) Two common approaches to split words in subsegments: - Sennrich et al. (2016): Byte Pair Encoding (BPE) - Schuster et al. (2012): Wordpieces - Both approaches split words into segments based on frequencies (no linguistic knowledge) In BPE: 1. Split words into characters (keep word boundaries) and collect frequencies 2. Consider all neighboring occurrences and collect frequencies 3. Merge the most frequent sequence and repeat from (2) Stop when a maximum number of merge operations has been reached BPE has three major advantages - significantly reduces the vocabulary size (less memory, better speed) - results in better translation quality - significantly reduces the number of out-of-vocabulary $(O O V)$ items - These benefits apply to both high- and low-resource language pairs Nowadays, BPE is typically used with 30,000 merge operations BPE with a radically reduced vocabulary size (e.g., 2,000) in combination with careful hyper-parameter tuning can lead to large improvements for low-resource translation --- ## References