Byte Pair Encoding - Notes on AI

# Byte Pair Encoding BPE tokenization takes the vocabulary V containing ordered merges and applies them to new text in the same order as they occurred during vocabulary construction. ![[BPE Algorithm.png]] ## WordPiece [[BERT]]'s vocabulary is constructed using WordPiece algorithm, a close variant of BPE. However, instead of merging the most frequent token bigram, each potential merge is scored based on the likelihood of an n-gram language model trained on a version of the corpus incorporating that merge. --- ## References 1. Byte Pair Encoding is Suboptimal for Language Model Pretraining https://arxiv.org/abs/2004.03720