# Tokenization
Tokenization is the process of converting text tokens into suitable discrete forms.
## Subword Tokenizations
[[Transformers]] models like [[BERT]] have incorporated an important design decisions that make them gracefully handle the open vocabulary problem: subword tokenization.
Subword tokenization produce tokens at multiple levels of granularity, from individual characters to full words. As a result, rare words are broken down into a collection of subword units, bottoming out in characters in the worst case.
Two most widely used subword tokenization methods are:
- [[Byte Pair Encoding]]
- [[SentencePiece - Unigram LM Encoding]]
Note: Both BPE and Unigram LM tokenization are similar in terms of vocabulary construction and inference.
![[BPE vs Unigram LM Encoding.png]]
## Benefits of tokenization
- Graceful OOV Handling
- Subword tokenization eliminates the out-of-vocabulary problem entirely. Under word-level tokenization, any unseen word maps to a single UNK token, losing all information. With BPE, unseen words are decomposed into known subword pieces—"unhappiness" becomes "un" + "happiness", letting the model leverage its existing knowledge of these components. In the worst case, a completely novel word falls back to individual characters or bytes, which are always in the vocabulary. This compositional structure means the model can generalize to novel words by combining the semantics of familiar pieces.
- Tail Compression
- Word frequency follows a power law distribution with a long tail of millions of rare words, each appearing only a handful of times in training data. **Learning good representations for these is essentially impossible.** BPE compresses this tail by dissolving rare words into subword pieces that are shared across many words—"zyzzyva" may be rare, but its pieces "zy" and "va" appear in other contexts. This redistributes the frequency mass from millions of rare words into a bounded vocabulary where even the least common tokens have been seen thousands of times. The result is a tractable tail: every token in the vocabulary has enough training signal to learn a meaningful representation.