# SentencePiece - Unigram LM Encoding
The unigram LM method, in contrast to the bottom-up construction process of [[Byte Pair Encoding]], begins with a superset of the final vocabulary, pruning it to the desired size.
![[Unigram LM Encoding.png]]
Unigram LM tokenization takes the vocabulary V and unigram LM parameters $\theta$ and performs Viterbi inference to decode the segmentation with maximum likelihood under $\theta$.
---
## References
1. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates https://arxiv.org/abs/1804.10959
2. Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720
3. SentencePiece Tokenizer Demystified https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15