Word embeddings - Notes on AI

# Word embeddings ## Semantics with dense vectors Two approaches to distributional semantics: 1. [[Count-based Distributional Models]] - Explicit vecors: dimensions are elements in the context - long sparse vecors with interpretatble dimensions 2. Prediction based models like [[Word2Vec]] - Train a model to predict context for a word, learn word representation in the process. - Produces short dense vectors with latent dimensions, so not interpretable. - low dimensional (e.g., 100 vs. $\left|V_{c}\right|$ ) - dense (no zeros) - continuous $\left(c_{w} \in \mathbb{R}^{m}\right)$ - learned by performing a task (predict) Why dense and continuous vectors? 1. easier to use as features in machine learning (less weights to tune) 2. may generalize better than storing explicit counts 3. may do better at capturing [[Distributional Semantics#Synonymy]]: - e.g. car and automobile are distinct dimensions in count-based models - will not capture similarity between a word with car as a neighbour and a word with automobile as a neighbour --- ## References 1. Chapter 6: Vector semantics and embeddings, Jurafsky and Martin (3rd edition).