Word2Vec - Notes on AI

# Word2Vec Paper: Mikolov et. al. 2013. Efficient Estimation of Word Representations in Vector Space. ## CBOW model - Task: Given a position $t$ in a sentence, and the $n$ words occurring to its left $\left(\left\{w_{t-n}, \ldots, w_{t-1}\right\}\right)$ and $m$ to its right $\left(\left\{w_{t+1}, \ldots, w_{t+n}\right\}\right)$ predict the word in position $t$ - Eg: the man X the road, where X=? - Seemingly similar to n-gram language modeling where n=LM order-1 and m=0 - Use feed-forward neural network - Focus on learning embeddings themselves - Simpler network (compared to [[Probabilistic Neural Network Language Models]]) - Bring embedding/projection layer closer to output - Typically $n=m$, and $n \in\{2,5,10\}$ ## Architecture ![[CBOW W2V.png]] - No non-linearities - One hidden layer: $h=\frac{1}{2 n} W w_{C}$, where - W is a $|h| \times|V|$ matrix - $w_{C}=\sum_{i=t-n, i \neq t}^{t+n} w_{i}$ - $w_{i}$ is a 1-hot vector for the word occurring in position $i$ - Output layer: - $\hat{y}=\operatorname{softmax}\left(W^{\prime} h\right)$ - $W^{\prime}$ is a $|V| \times|h|$ matrix - $W^{\prime}$ and $W$ are not (necessarily) shared, i.e., $W^{\prime} \neq W^{T}$ - Loss function: cross entropy (see PNLM) - Trained with SGD ### Embeddings Where do the embeddings live? - Column $i$ in $W(|h| \times|V|$ matrix) represents the embedding for word $i$ - Row $i$ in $W^{\prime}(|V| \times|h|$ matrix $)$ represents the embedding for word $i$ Which one of the two? - Typically $W$ or - $W_{s}=W^{T}+W^{\prime}$ (combining both into one) ## Skip-gram model - inspired by work on neural language models - train a neural network to predict neighboring words - learn dense embeddings for the words in the training corpus in the process ![[word2vec-skipgram.jpg]] Intuition: words with similar meanings often occur near each other in texts Given a word $w_{t}$ : Predict each neighbouring word in a context window of 2L words from the current word. For $L=2$, we predict its 4 neighbouring words: $ \left[w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}\right] $ We need to learns 2 paramteter matrices so that there are two embeddings (vectors) for each word $w_{j} \in V_{w}$: - Word embedding v, in word matrix W - Context embedding c, in context matrix C ![[word2vec-embeddings.jpg]] Walk through the corpus pointing at word $w(t),$ whose index in the vocabulary is $j-$ we will call it $w_{j}$ our goal is to predict $w(t+1),$ whose index in the vocabulary is $k-$ we will call it $w_{k}$ to do this, we need to compute $ p\left(w_{k} \mid w_{j}\right) $ Note that this is not bigram probability as we do not care about the sequence of the words, just that they are inside the context window. To compute this probability we need to compute similarity between $w_{j}$ and $w_{k}$. Skipgram measures similarity as dot-product between the target vector and context vector (cosine similaity is just the normalized dot product). The idea is that similar vectors have a high dot product. $ \text { Similarity }\left(c_{k}, v_{j}\right) \propto c_{k} \cdot v_{j} $ ![[word2vec-similarity.jpg]] To compute the probability from the similarity, it uses softmax $ p\left(w_{k} \mid w_{j}\right)=\frac{e^{c_{k} \cdot v_{j}}}{\sum_{i \in V} e^{c_{i} \cdot v_{j}}} $ ### Learning Start with some initial embeddings (usually random) At training time, walk through the corpus iteratively make the embeddings for each word more like the embeddings of its neighbors less like the embeddings of other words. Learn parameters $C$ and $W$ that maximize the overall corpus probability: $ \begin{array}{c} \arg \max _{\left(w_{j}, w_{k}\right) \in D} p\left(w_{k} \mid w_{j}\right) \\ p\left(w_{k} \mid w_{j}\right)=\frac{e^{c_{k} \cdot v_{j}}}{\sum_{i \in V} e^{c_{i} \cdot v_{j}}} \\ \operatorname{argmax} \prod_{\left(w_{j}, w_{k}\right) \in D} p\left(w_{k} \mid w_{j}\right)=\prod_{\left(w_{j}, w_{k}\right) \in D} \frac{e^{c_{k} \cdot v_{j}}}{\sum_{i \in V} e^{c_{i} \cdot v_{j}}} \end{array} $ ### Visualizing skip-gram as a network ![[word2vec-network.jpg]] Computing softmax over the entire vocabulary is a bad idea as the context is very large and it gets hard to train and update the model. The solution was proposed as with negative sampling. ### Skip-gram with negative sampling Problem with softmax: expensive to compute the denominator for the whole vocabulary $ p\left(w_{k} \mid w_{j}\right)=\frac{e^{\alpha_{k} \cdot v_{j}}}{\sum_{i \in V} e^{q_{i} \cdot v_{j}}} $ Approximate the denominator: negative sampling - At training time, walk through the corpus - for each target word and positive context - sample $k$ noise samples or negative samples, i.e. other words Then the objective becomes - Make the word like the context words. - And not like the k negative examples. Convert the dataset into word pairs i.e. postive (example apricot-jam, apricot-tablespoon) and negative (apricot-cement, apricot-dear). Then istead of treating it as a multi-class problem (and returning a probability distribution over the whole vocabulary). Then return a probability that word $w_{k}$ is a valid context for word $w_{j}$ $ \begin{array}{c} P\left(+\mid W_{j}, W_{k}\right) \\ P\left(-\mid w_{j}, W_{k}\right)=1-P\left(+\mid W_{j}, W_{k}\right) \end{array} $ Model similarity as dot product $ \text { Similarity }\left(c_{k}, v_{j}\right) \propto c_{k} \cdot v_{j} $ Turn this into a probability using the sigmoid function: $ \sigma(x)=\frac{1}{1+e^{-x}} $ Then, the model boils down to $ \begin{array}{c} P\left(+\mid w_{j}, w_{k}\right)=\frac{1}{1+e^{-c_{k} \cdot v_{j}}} \\ P\left(-\mid w_{j}, w_{k}\right)=1-P\left(+\mid w_{j}, w_{k}\right)=1-\frac{1}{1+e^{-c_{k} \cdot v_{j}}}=\frac{1}{1+e^{c_{k} \cdot v_{j}}} \end{array} $ And the object boils down to $ \begin{array}{c} \arg \max \prod_{\left(w_{j}, W_{k}\right) \in D_{+}} p\left(+\mid w_{k}, w_{j}\right) \prod_{\left(w_{j}, w_{k}\right) \in D_{-}} p\left(-\mid w_{k}, w_{j}\right) \\ \arg \max \prod_{\left(w_{j}, w_{k}\right) \in D_{+}} \log p\left(+\mid w_{k}, w_{j}\right)+\sum_{\left(m_{j}, w_{k}\right) \in D_{-}} \log p\left(-\mid w_{k}, w_{j}\right)= \\ \arg \max \sum_{\left(w_{j}, w_{k}\right) \in D_{+}} \log \frac{1}{1+e^{-c_{k} \cdot v_{j}}}+\sum_{\left(w_{j}, w_{k}\right) \in D_{-}} \log \frac{1}{1+e^{c_{k} \cdot v_{j}}} \end{array} $ The matrix $W$ is our final word embedding. ## Practical Considerations - Skip-Gram: - For each observed occurrence $(w, c)$ add 5-20 negative samples to data - Draw $c$ from uni-gram distribution $P(w)$ - Scale uni-gram distribution: $P(w)^{0.75}$ to bias rarer words - Context size typically around $2-5$ - The more data the smaller the context and the negative sample set - Exclude very rare words (less than 10 occurrences) - Removing stop words: better topical modeling, less sensitive to syntactic patterns - Context size has a significant effect on what information is encoded in embeddings - Small context size: more sensitive to syntactic role of target word - Large context size: more sensitive to semantic, topical role of target word - Similar to [[Count-based Distributional Models]], context words can be annotated with syntactic information ## Properties of embeddings 1. They capture similarity. ![[word2vec-examples.jpg]] 2. They capture analogy. Analogy task: $a$ is to $b$ as $c$ is to $d$ The system is given words $a, b, c,$ and it needs to find $d$. Examples: - "apple" is to "apples" as "car" is to ? - "man" is to "woman" as "king" is to ? Solution: capture analogy via vector offsets. Can capture relations like genders, numbers etc. $ \begin{array}{c} a-b \approx c-d \\ \operatorname{man}-\text {woman} \approx \text { king }-\text { queen } \\ d_{w}=\underset{d_{w}^{\prime} \in V}{\operatorname{argmax}} \cos \left(a-b, c-d^{\prime}\right) \end{array} $ 3. They capture a range of semantic relations. ![[word2vec-semantic.jpg]] ## Evaluation of Word embeddings - Intrinsic - Visualization, but take it with a grain of salt - Similarity tasks - Rank list of word pairs by similarity (car, bicycle) - Spearman correlation with human judgements - Analogy task - Paris is to France as Berlin is to X - Evaluated by accuracy - Artihmetic magic - Extrinsic - Use in end to end task - Use in [[Information Retrieval]] tasks ## Limitations of Word Embeddings - Sensitive to morphological differences (buy/buys) - Not necessarily consistent across languages - Embedding dimensions are not interpretable - Frequency-biased, e.g., gender bias - Can mix up antonyms (hot/cold) - Words have different meaning in different context, so one-to-one mapping doesn't always make sense ### Word embeddings in practice Word2vec is often used for pretraining in other tasks. - It will help your models start from an informed position - Requires only plain text - which we have a lot of - Is very fast and easy to use - Already pretrained vectors also available (trained on 100B words) However, for best performance it is important to continue training, fine-tuning the embeddings for a specific task. ### Count-based models vs. skip-gram word embeddings [[Count-based Distributional Models]] and Word2Vec type models are two approaches to [[Distributional Semantics]]. Baroni et. al. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. This paper does comparison of count-based and neural word vectors on 5 types of tasks and 14 different datasets: 1. Semantic relatedness 2. Synonym detection 3. Concept categorization 4. Selectional preferences 5. Analogy recovery Results: ![[count-vs-skipgram.jpg]] Some of these findings were later disputed by Levy et. al. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings" --- ## References 1. Chapter 6: Vector semantics and embeddings, Jurafsky and Martin (3rd edition).