Part of Speech Tagging

# Part of Speech Tagging POS is useful because: 1. First step of syntactic analysis (which in turn is often useful for semantic analysis) 2. Simpler models and often faster than full syntactic parsing, but sometimes enough to be useful - POS tags can be useful features ex in text classification - Can be useful for applications like speech synthesis ## Tagsets Tagset are standarized codes for fine-grained parts of speech. 1. CLAWS 5 -Over 60 tags 2. Penn Treebank Part of Speech Tagset - 45 tags ![[Screenshot 2020-11-12 at 3.16.00 PM.jpg]] ## Ambiguity in POS tagging POS Tagging is a disambiguation task, words are ambiguous. It's goal is to find the correct tag for the situation. Some of the most ambiguous frequent words are that, back, down, put and set; here are some examples of the 6 different parts of speech for the word back: - earnings growth took a _back/JJ_ seat - a small building in the _back/NN_ - a clear majority of senators _back/VBP_ the bill Dave began to _back/VB_ toward the door enable the country to buy _back/RP_ about debt - I was twenty-one _back/RB_ then The Brown corpus (1,000,000 word tokens) has 39,440 different word types. - 35340 have only 1 POS tag anywhere in corpus (89.6 %) - 4100(10.4%) have 2 to 7 POS tags So why does just 10.4 % POS-tag ambiguity by word type lead to difficulty? It's because they are the most common words. These 10.4% words lead to about 50% of the ambiguity. ## Uni-gram POS Tagging Just assign to each word its most common tag. Surprisingly this crude approach gives around 90% accuracy. ## HMM POS tagging 1. Start with untagged text. 2. Assign all possible tags to each word in the text on the basis of a lexicon that associates words and tags. 3. Find the most probable sequences (n-best sequences) of tags, based on probabilities from the training data. - lexical probability: e.g., is _can_ most likely to be VMO, VVB, VVI or NN1? - and tag sequence probabilities: e.g., is VMO or NN1 more likely after PNP? Estimate tag sequence: $n$ tags with the maximum probability, given $n$ words: $ \hat{t}_{1}^{n}=\underset{t_{1}^{n}}{\operatorname{argmax}} P\left(t_{1}^{n} \mid w_{1}^{n}\right) $ From Bayes Theorem, $ P\left(t_{1}^{n} \mid w_{1}^{n}\right)=\frac{P\left(w_{1}^{n} \mid t_{1}^{n}\right) P\left(t_{1}^{n}\right)}{P\left(w_{1}^{n}\right)} $ but $P\left(w_{1}^{n}\right)$ is constant: $ \hat{t}_{1}^{n}=\underset{t_{1}^{n}}{\operatorname{argmax}} P\left(w_{1}^{n} \mid t_{1}^{n}\right) P\left(t_{1}^{n}\right) $ Bigram assumption: probability of a tag depends on previous tag, hence product of bigrams: $ P\left(t_{1}^{n}\right) \approx \prod_{i=1}^{n} P\left(t_{i} \mid t_{i-1}\right) $ Probability of word estimated on basis of its tag alone: $ P\left(w_{1}^{n} \mid t_{1}^{n}\right) \approx \prod_{i=1}^{n} P\left(w_{i} \mid t_{i}\right) $ Hence: $ \hat{t}_{1}^{n}=\underset{t_{1}^{n}}{\operatorname{argmax}} \prod_{i=1}^{n} P\left(w_{i} \mid t_{i}\right) P\left(t_{i} \mid t_{i-1}\right) $ Training the POS tagger: Compute the count of tag, calculate bigram probability. ![[Screenshot 2020-11-12 at 3.42.57 PM.jpg]] In practice: 1. Maximise the overall tag sequence probabilities 2. Actual systems use trigrams - smoothing and backoff are critical. 3. Unseen words: these are not in the lexicon, so use all possible* open class* tags, possibly restricted by morphology. Example, new words like selfie or tweet. Evaluation of POS tagging: 1. Accuracy 2. One tag per word (some systems give multiple tags when uncertain) 3. Accuracy over 97-98% for English, one of the few things NLP can do with high confidence 4. Baseline is 90% accuracy (from unigram tragging) ## References 1. Chapter 8, Jurafsky & Martin, 3rd Edition