Language Models - Notes on AI

# Language Models Linguists vs Statistics: > But it must be recognized that the notion 'probability of a sentence' is an entirely useless one, under any known interpretation of this term. (Chomsky 1969) > Whenever I fire a linguist our system performance improves. (Jelinek, 1988: reported) Language modeling is not an end-to-end task, but - language models are important components of many NLP end-to-end tasks - many end-to-end tasks can be considered an extension of a language model, e.g., machine translation The probability of a word sequence is a statistical approximation of its - syntactic fluency - semantic appropriateness - general well-formedness In speech recognition: - $p($ We won't $I$ scream $)<p($ We want ice cream $)$ In machine translation: - $p($ Get we ice cream $)<p($ We want ice cream) Most language models use the chain rule to decompose the joint probability into a product of conditional probabilities: - $p\left(w_{1} \ldots w_{n}\right)=p\left(w_{1}\right) p\left(w_{2} \mid w_{1}\right) p\left(w_{3} \mid w_{1} w_{2}\right) \ldots p\left(w_{n} \mid w_{1} \ldots w_{n-1}\right)$ - Decomposition via the chain rule is exact - Conditioning on long histories can become problematic due to zero counts - Markov assumption: $p(w \mid h)$ only depends on the last $n-1$ words in $h$ - n indicates the Markov order and leads to n-grams ## n-grams - N-gram language model compute the probability of a string as the product of probabilities of the consecutive n-grams: - $p(<\mathrm{s}>$ the man has a dog bought $\cdot\langle/ \mathrm{s}>$ ) $=p($ the $\mid<\mathrm{s}>) \cdot p(\operatorname{man} \mid<\mathrm{s}>$ the $) \cdot p$ (has|the man $) \cdot p(\mathrm{a} \mid$ man has $) \cdot p(\operatorname{dog} \mid$ has a $)$. - Generally: $p\left(w_{1}^{N}\right)=\prod_{i=1}^{N} p\left(w_{i} \mid w_{i-n+1}^{i-1}\right)$, for order $n$ - Problem: if one $\mathrm{n}$-gram probability is zero, e.g., $p(. \mid$ dog bought $)=0$, then the probability of the entire product is zero - The more context (by increasing n) we use for modelling, the more accurate the model will become, but the harder it will be to estimate the parameters as it requires larger corpora. It is a trade-off. - Given a real utterance $w_{1}^{N}$ in a language, a good language model will assign a high probability to $w_{1}^{N}$ - Commonly measure with cross entropy: $ H\left(w_{1}^{N}\right)=-\frac{1}{N} \log _{2} p\left(w_{1}^{N}\right) $ - Intuitively, cross entropy measures how many bits are needed to encode this text in our model - Alternatively, one can use perplexity: $\operatorname{perplexity}\left(w_{1}^{N}\right)=2^{H\left(w_{1}^{N}\right)}$ - Intuitively, perplexity measures how surprised our model is when observing each word. ### Bigram N-gram model with n=2. Assumption is next word depends only on the previous word. A probability is assigned to a word based on the previous word: $ P\left(w_{n} \mid w_{n-1}\right) $ where $w_{n}$ is the nth word in a sentence. Probability of a sequence of words (assuming independence): $ P\left(W_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} \mid w_{k-1}\right) $ Probability is estimated with [[Maximum Likelihood Estimation#MLE of Binomial Distribution]] Probability is estimated from counts in a training corpus: $ P\left(w_{n} \mid w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)}{\sum_{w} C\left(w_{n-1} w\right)} \approx \frac{C\left(w_{n-1} w_{n}\right)}{C\left(w_{n-1}\right)} $ i.e. count of a particular bigram in the corpus divided by the count of all bigrams starting with the prior word. ### Trigram A probability is assigned to a word based on two previous words: $ P\left(w_{n} \mid w_{n-1} w_{n-2}\right) $ where $w_{n}$ is the nth word in a sentence. Probability of a sequence of words (assuming independence): $ P\left(W_{1}^{n}\right) \approx \prod_{k=1}^{n} P\left(w_{k} \mid w_{k-1} w_{k-2}\right) $ These are the more practically used models. ## Smoothing Zero probability of a particular n-gram might cause the probability of entire sentence to be zero (because of multiplication of probs). This arises when we cant see all the possible n-grams in the data and calculate their probability during training. Solutions: - smoothing: distribute 'extra' probability between rare and unseen events - backoff and interpolation: approximate unseen probabilities by a more general probability, e.g. unigrams ### Laplace (add 1) smoothing $ P\left(w_{n} \mid w_{n-1}\right)=\frac{C\left(w_{n-1} w_{n}\right)+1}{C\left(w_{n-1}\right)+|V|} $ Simple to implement, BUT Only suitable for problems with few unseen events. In reality, we have a lot of unseen n-grams. But add-1 is used to smooth other NLP models e.g. for text classification and in domains where the number of zeros isn't so huge. Other smoothing: - [[Dirichlet smoothing]] - Kneser-Ney smoothing - Weight lower-order probabilities by the number of contexts in which they occur ### Backoff and Interpolation Sometimes it helps to use less context. The idea is to condition on less context for contexts you haven't learned much about. Two approaches: Backoff: Use trigram if you have good evidence,otherwise bigram, otherwise unigram. (Katz Smoothing) Interpolation: Mix unigram, bigram, trigram. Interpolation works better than backoff in practice Linear Interpolation [[Jelinek-Mercer smoothing]] Other advanced smoothing methods are: Absolute discounting, Good Turing smoothing etc. ## Handling unknown words Most tasks in NLP are open vocabulary tasks. Test data will contain out of vocabulary (OOV) words. A simple solution widely used is: 1. Create an unknown word token UNK 2. Train UNK probabilities - Create a fixed lexicon L of size V - in the corpus, replace all words not in L with UNK train its probabilities like a normal word use UNK probabilities for any OOV word ## Quantization - How precise do language model probabilities have to be? - We are not so much interested the actual probability but its ability to discriminate between degrees of fluency - Log probabilities are typically stored as floats ( 4 bytes) - Quantitization approximates actual probabilities by using a lower bit representation - Two approaches to quantitization: - Lloyd's algorithm: Cluster values using [[K-Means]], distance is absolute numerical difference, keep only k centroids - Binning: Partition values into uniformly populated intervals - Numerically sort all unique values and partition into k non-overlapping intervals - The k intervals are represented by respective mean - Pointers to approximated probabilities can be stored in $\left\lceil\log _{2} k\right\rceil$ bits ## Strengths of n-gram models - Count-based language model strengths: - Extremely scalable to huge training data sets - larger LM results in better end-task performance, e.g., SMT - Good at memorizing - exact probabilities do not matter that much (quantitization) - Very fast/constant querying time - basically hash table lookup ## Limitations of n-gram models - Requires large amount of memory - State-of-the-art LMs easily need 50-100G of memory - N-gram models use limited context - cannot model long-distance dependencies in language - higher-order LMs are pathologically sparse and only benefits from larger data to a limited extent - Not very good at generalizing beyond what it has observed - Count-based LMs do not capture relationships between: - semantically related words: 'green' vs. 'blue' - morphologically related words: 'bought' vs. 'buys' - Good alternative: [[Probabilistic Neural Network Language Models]] --- ## References 1. Chapter 3, Jurafsky and Martin, 2019 2. DL4NLP Course, UvA 2021