# Tokenization
Tokenization is the process of converting text tokens (or sub-tokens) into discrete forms i.e numbers.
## Subword Tokenizations
[[Transformers]] models like [[BERT]] have incorporated an important design decisions that make them gracefully handle the open vocabulary problem: subword tokenization.
Subword tokenization produce tokens at multiple levels of granularity, from individual characters to full words. As a result, rare words are broken down into a collection of subword units, bottoming out in characters in the worst case.
Two most widely used subword tokenization methods are:
- [[Byte Pair Encoding]]
- [[SentencePiece - Unigram LM Encoding]]
Note: Both BPE and Unigram LM tokenization are similar in terms of vocabulary construction and inference.
![[BPE vs Unigram LM Encoding.png]]
---
## References
1. Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720