Tokenization - Notes on AI

# Tokenization Tokenization is the process of converting text tokens (or sub-tokens) into discrete forms i.e numbers. ## Subword Tokenizations [[Transformers]] models like [[BERT]] have incorporated an important design decisions that make them gracefully handle the open vocabulary problem: subword tokenization. Subword tokenization produce tokens at multiple levels of granularity, from individual characters to full words. As a result, rare words are broken down into a collection of subword units, bottoming out in characters in the worst case. Two most widely used subword tokenization methods are: - [[Byte Pair Encoding]] - [[SentencePiece - Unigram LM Encoding]] Note: Both BPE and Unigram LM tokenization are similar in terms of vocabulary construction and inference. ![[BPE vs Unigram LM Encoding.png]] --- ## References 1. Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720