# Part of Speech Tagging
POS is useful because:
1. First step of syntactic analysis (which in turn is often useful for semantic analysis)
2. Simpler models and often faster than full syntactic parsing, but sometimes enough to be useful
- POS tags can be useful features ex in text classification
- Can be useful for applications like speech synthesis
## Tagsets
Tagset are standarized codes for fine-grained parts of speech.
1. CLAWS 5 -Over 60 tags
2. Penn Treebank Part of Speech Tagset - 45 tags
![[Screenshot 2020-11-12 at 3.16.00 PM.jpg]]
## Ambiguity in POS tagging
POS Tagging is a disambiguation task, words are ambiguous. It's goal is to find the correct tag for the situation.
Some of the most ambiguous frequent words are that, back, down, put and set; here are some examples of the 6 different parts of speech for the word back:
- earnings growth took a _back/JJ_ seat
- a small building in the _back/NN_
- a clear majority of senators _back/VBP_ the bill Dave began to _back/VB_ toward the door enable the country to buy _back/RP_ about debt
- I was twenty-one _back/RB_ then
The Brown corpus (1,000,000 word tokens) has 39,440 different word types.
- 35340 have only 1 POS tag anywhere in corpus (89.6 %)
- 4100(10.4%) have 2 to 7 POS tags
So why does just 10.4 % POS-tag ambiguity by word type lead to difficulty? It's because they are the most common words. These 10.4% words lead to about 50% of the ambiguity.
## Uni-gram POS Tagging
Just assign to each word its most common tag. Surprisingly this crude approach gives around 90% accuracy.
## HMM POS tagging
1. Start with untagged text.
2. Assign all possible tags to each word in the text on the basis of a lexicon that associates words and tags.
3. Find the most probable sequences (n-best sequences) of tags, based on probabilities from the training data.
- lexical probability: e.g., is _can_ most likely to be VMO, VVB, VVI or NN1?
- and tag sequence probabilities: e.g., is VMO or NN1 more likely after PNP?
Estimate tag sequence: $n$ tags with the maximum probability, given $n$ words:
$
\hat{t}_{1}^{n}=\underset{t_{1}^{n}}{\operatorname{argmax}} P\left(t_{1}^{n} \mid w_{1}^{n}\right)
$
From Bayes Theorem,
$
P\left(t_{1}^{n} \mid w_{1}^{n}\right)=\frac{P\left(w_{1}^{n} \mid t_{1}^{n}\right) P\left(t_{1}^{n}\right)}{P\left(w_{1}^{n}\right)}
$
but $P\left(w_{1}^{n}\right)$ is constant:
$
\hat{t}_{1}^{n}=\underset{t_{1}^{n}}{\operatorname{argmax}} P\left(w_{1}^{n} \mid t_{1}^{n}\right) P\left(t_{1}^{n}\right)
$
Bigram assumption: probability of a tag depends on previous tag, hence product of bigrams:
$
P\left(t_{1}^{n}\right) \approx \prod_{i=1}^{n} P\left(t_{i} \mid t_{i-1}\right)
$
Probability of word estimated on basis of its tag alone:
$
P\left(w_{1}^{n} \mid t_{1}^{n}\right) \approx \prod_{i=1}^{n} P\left(w_{i} \mid t_{i}\right)
$
Hence:
$
\hat{t}_{1}^{n}=\underset{t_{1}^{n}}{\operatorname{argmax}} \prod_{i=1}^{n} P\left(w_{i} \mid t_{i}\right) P\left(t_{i} \mid t_{i-1}\right)
$
Training the POS tagger:
Compute the count of tag, calculate bigram probability.
![[Screenshot 2020-11-12 at 3.42.57 PM.jpg]]
In practice:
1. Maximise the overall tag sequence probabilities
2. Actual systems use trigrams - smoothing and backoff are critical.
3. Unseen words: these are not in the lexicon, so use all possible* open class* tags, possibly restricted by morphology. Example, new words like selfie or tweet.
Evaluation of POS tagging:
1. Accuracy
2. One tag per word (some systems give multiple tags when uncertain)
3. Accuracy over 97-98% for English, one of the few things NLP can do with high confidence
4. Baseline is 90% accuracy (from unigram tragging)
## References
1. Chapter 8, Jurafsky & Martin, 3rd Edition