# Count-based Distributional Models
[[Distributional Semantics]] models based on the count of words in corpus.
## Defining context
1. Word windows (unfiltered): n words on either side of the lexical item. Example: n=2 (5 words window $) - The prime minister acknowledged the 1 question. So context for minister: [ the 2, prime 1, acknowledged 1, question 0 ]
2. Word windows (filtered): n words on either side removing some words (e.g. function words, some very frequent content words). Stop-list or by POS-tag. Example: n=2 (5 words window), stop-list: The prime minister acknowledged the 1 question. Context for minister: [ prime 1, acknowledged 1, question 0 ]
3. Lexeme window (filtered or unfiltered); as above but using stems. Example: n=2 (5 words window), stop-list. Example The prime minister acknowledged the 1 question. Context for minister: [ prime 1, acknowledge 1, question 0 ]. Helps with sparsity.
4. Dependencies (directed links between heads and dependents). Context for a lexical item is the dependency structure it belongs to (various definitions). Example: The prime minister acknowledged the question.
- minister [ prime_a 1, acknowledge_v 1]
- minister [ prime_a_mod 1, acknowledge_v_subj 1]
- minister [ prime_a 1, acknowledge_v+question_n 1]
## Weighting context
1. Binary mode: if context c co-occurs with word w, valur of vector $\vec{w}$ for dimension c is 1, 0 otherwise.
2. Basic frequency model: the value of vector $\vec{w}$ for dimension $c$ is the number of times that co-occurs with w.
3. Character model
Weights given to the vector components express how characteristic a given context is for word w.
## Defining semantic space
Entire vocabulary.
- (+) All information included - even rare contexts
- (-) Inefficient (100,000s dimensions). Noisy and sparse.
Top n words with highest frequencies.
- (+) More efficient (2000-10000 dimensions). Only 'real' words included.
- (-) May miss out on infrequent but relevant contexts.
Word frequency are Zipfian distributed
![[zipfian.jpg]]
[[Singular Value Decomposition (SVD)]] (SVD): the number of dimensions is reduced by exploiting redundancies in the data.
- (+) Very efficient (200-500 dimensions). Captures generalisations in the data.
- (-) SVD matrices are not interpretable.
[[Non-negative matrix factorization]] (NMF)
- Similar to SVD in spirit, but performs factorization differently.
## Corpus Choice
As much data as possible?
- British National Corpus (BNC): 100M words (but balanced corpus)
- Wikipedia: 897M words
- UKWac: 2B words
In general preferable, but:
- More data is not necessarily the data you want.
- More data is not necessarily realistic from a psycholinguistic point of view. We perhaps encounter 50,000 words a day. BNC= 5 years' text exposure.
## Problems
Context vectors in distributional semantics
- tend to be high dimensional (depending on $\left|V_{c}\right|$ )
- tend to be sparse (many words do not occur in the context of a given target word)
- do not distinguish between left and right context
- do not distinguish between syntactic roles of context words
- Sparsity can be somewhat alleviated by
- increasing size of training corpus
- increasing window size
- normalizing inflections (stemming, lemmatizing)
---
## References
1. Chapter 6: Vector semantics and embeddings, Jurafsky and Martin (3rd edition).