SentencePiece - Unigram LM Encoding

# SentencePiece - Unigram LM Encoding The unigram LM method, in contrast to the bottom-up construction process of [[Byte Pair Encoding]], begins with a superset of the final vocabulary, pruning it to the desired size. ![[Unigram LM Encoding.png]] Unigram LM tokenization takes the vocabulary V and unigram LM parameters $\theta$ and performs Viterbi inference to decode the segmentation with maximum likelihood under $\theta$. --- ## References 1. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates https://arxiv.org/abs/1804.10959 2. Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720 3. SentencePiece Tokenizer Demystified https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15