# BERT
- BERT (Bidirectional Encoder Representations from Transformers) is a big [[Transformers|Transformer]] model trained on two unsupervised tasks:
- Masked language modeling
- Next sentence prediction
- General NLP model that can be used for
- fine-tuning task-specific models
- Create contextualized word embeddings like [[ELMo|ELMo]] or create sentence embeddings
## Architecture
- Includes only the encoder stack of the originally proposed Transformer
- Can take sequence length of 512
![[BERT Architecture.png]]
### BERT Base
- Comparable in size of the OpenAI Transformer in order to compare performance
- 12 Transformer layers, 12 self-attention heads, and 768 hidden dimensions
- 110 million parameters
### BERT Large
- Model that achieved the state of the art results reported in the paper
- 24 Transformer layers, 16 self-attention heads, and 1024 hidden dimensions
- 340 million parameters
## Training
### Pre-training
- Fairly expensive (4 days on 16 TPUs) but one-time procedure for each language
- Masked language modelling
- Mask 15% of the input, and also randomly replace a word with another word
- Next sentence prediction
- Given two sentences A and B, is B likely to be the sentence that follows A or not?
### Fine-tuning
- Inexpensive, all results in paper can be replicated in at most 1 hour on a single TPU
- Can be used in multiple ways to train task-specific models
- ![[Task Specific BERT.png]]
---
## References
1. The Illustrated BERT http://jalammar.github.io/illustrated-bert/
2. Original Tensorflow implementation https://github.com/google-research/bert