Count-based Distributional Models

# Count-based Distributional Models [[Distributional Semantics]] models based on the count of words in corpus. ## Defining context 1. Word windows (unfiltered): n words on either side of the lexical item. Example: n=2 (5 words window $) - The prime minister acknowledged the 1 question. So context for minister: [ the 2, prime 1, acknowledged 1, question 0 ] 2. Word windows (filtered): n words on either side removing some words (e.g. function words, some very frequent content words). Stop-list or by POS-tag. Example: n=2 (5 words window), stop-list: The prime minister acknowledged the 1 question. Context for minister: [ prime 1, acknowledged 1, question 0 ] 3. Lexeme window (filtered or unfiltered); as above but using stems. Example: n=2 (5 words window), stop-list. Example The prime minister acknowledged the 1 question. Context for minister: [ prime 1, acknowledge 1, question 0 ]. Helps with sparsity. 4. Dependencies (directed links between heads and dependents). Context for a lexical item is the dependency structure it belongs to (various definitions). Example: The prime minister acknowledged the question. - minister [ prime_a 1, acknowledge_v 1] - minister [ prime_a_mod 1, acknowledge_v_subj 1] - minister [ prime_a 1, acknowledge_v+question_n 1] ## Weighting context 1. Binary mode: if context c co-occurs with word w, valur of vector $\vec{w}$ for dimension c is 1, 0 otherwise. 2. Basic frequency model: the value of vector $\vec{w}$ for dimension $c$ is the number of times that co-occurs with w. 3. Character model Weights given to the vector components express how characteristic a given context is for word w. ## Defining semantic space Entire vocabulary. - (+) All information included - even rare contexts - (-) Inefficient (100,000s dimensions). Noisy and sparse. Top n words with highest frequencies. - (+) More efficient (2000-10000 dimensions). Only 'real' words included. - (-) May miss out on infrequent but relevant contexts. Word frequency are Zipfian distributed ![[zipfian.jpg]] [[Singular Value Decomposition (SVD)]] (SVD): the number of dimensions is reduced by exploiting redundancies in the data. - (+) Very efficient (200-500 dimensions). Captures generalizations in the data. - (-) SVD matrices are not interpretable. [[Non-negative matrix factorization]] (NMF) - Similar to SVD in spirit, but performs factorization differently. ## Corpus Choice As much data as possible? - British National Corpus (BNC): 100M words (but balanced corpus) - Wikipedia: 897M words - UKWac: 2B words In general preferable, but: - More data is not necessarily the data you want. - More data is not necessarily realistic from a psycholinguistic point of view. We perhaps encounter 50,000 words a day. BNC= 5 years' text exposure. ## Problems Context vectors in distributional semantics - tend to be high dimensional (depending on $\left|V_{c}\right|$ ) - tend to be sparse (many words do not occur in the context of a given target word) - do not distinguish between left and right context - do not distinguish between syntactic roles of context words - Sparsity can be somewhat alleviated by - increasing size of training corpus - increasing window size - normalizing inflections (stemming, lemmatizing) --- ## References 1. Chapter 6: Vector semantics and embeddings, Jurafsky and Martin (3rd edition).