# BERT - BERT (Bidirectional Encoder Representations from Transformers) is a big [[Transformers|Transformer]] model trained on two unsupervised tasks: - Masked language modeling - Next sentence prediction - General NLP model that can be used for - fine-tuning task-specific models - Create contextualized word embeddings like [[ELMo|ELMo]] or create sentence embeddings ## Architecture - Includes only the encoder stack of the originally proposed Transformer - Can take sequence length of 512 ![[BERT Architecture.png]] ### BERT Base - Comparable in size of the OpenAI Transformer in order to compare performance - 12 Transformer layers, 12 self-attention heads, and 768 hidden dimensions - 110 million parameters ### BERT Large - Model that achieved the state of the art results reported in the paper - 24 Transformer layers, 16 self-attention heads, and 1024 hidden dimensions - 340 million parameters ## Training ### Pre-training - Fairly expensive (4 days on 16 TPUs) but one-time procedure for each language - Masked language modelling - Mask 15% of the input, and also randomly replace a word with another word - Next sentence prediction - Given two sentences A and B, is B likely to be the sentence that follows A or not? ### Fine-tuning - Inexpensive, all results in paper can be replicated in at most 1 hour on a single TPU - Can be used in multiple ways to train task-specific models - ![[Task Specific BERT.png]] --- ## References 1. The Illustrated BERT http://jalammar.github.io/illustrated-bert/ 2. Original Tensorflow implementation https://github.com/google-research/bert