Probabilistic Neural Network Language Models

# Probabilistic Neural Network Language Models - Count-based language models use discrete parameters as elements of the event space - The current word+history n-gram has or has not been seen during training (binary decision) - Smoothing results in a more relaxed matching criterion - Probabilistic Neural Network LMs (Bengio et al. JMLR 2003) use a distributed real-valued representation of words and contexts - Each word in the vocabulary is mapped to a $m$-dimensional real-valued vector - $C(w) \in \mathbb{R}^{m}$, typical values for $m$ are $50,100,150$ - A hidden layer capture the contextual dependencies between words in an $\mathrm{n}$-gram - The output layer is a $|V|$-dimensional vector describing the probability distribution of $p\left(w_{i} \mid w_{i-n+1}^{i-1}\right)$ ## Architecture ![[PNNLM.png]] Layer-1 (projection layer) $C\left(w_{t-i}\right)=C w_{t-i}$ where - $w_{t-i}$ is a $V$-dimensional 1-hot vector, i.e., a zero-vector where only the index corresponding the word occurring at position $t-i$ is 1 - $C$ is a $m \times V$ matrix Layer-2 (context layer) $h=\tanh (d+H x)$ where - $x=\left[C\left(w_{t-n+1}\right) ; \ldots ; C\left(w_{t-1}\right)\right] \quad([\cdot ; \cdot]=$ vector concatenation) - $H$ is a $n \times(l-1) m$ matrix Layer-3 (output layer) $\hat{y}=\operatorname{softmax}(b+U h)$ $\hat{y}=\operatorname{softmax}(b+U h)$ where - $U$ is a $V \times n$ matrix - $\operatorname{softmax}(v)=\frac{\exp \left(v_{i}\right)}{\sum_{i} \exp \left(v_{i}\right)}$ (turns activations into probs) Optional: skip-layer connections $\hat{y}=\operatorname{softmax}(b+W x+U h)$ where - $W$ is a $V \times(l-1) m$ matrix (skipping the non-linear context layer) Training: Use cross entropy and update all params including C (projections) ## Projection C captures: - maps discrete words to continuous, low dimensional vectors - $C$ is shared across all contexts - $C$ is position-independent - if $C$ (white $) \approx C($ red $)$ then $p($ drives $\mid$ a white car $) \approx p($ drives $\mid a$ red car $)$ ## Advantages - PNLMs outperform n-gram based language models (in terms of perplexity) - Use limited amount of memory - NPLM: $\sim 100 \mathrm{M}$ floats $\approx 400 \mathrm{M}$ RAM - n-gram model: 〜50-100G RAM ## Disadvantages: - Computationally expensive - Mostly due to large output layer (size of vocabulary): $U h$ can involve hundreds of millions of operations! - We want to know $p(w \mid C)$ for a specific $w$, but to do so we need softmax over entire output layer - Quick-fix: Use short-lists i.e. ignore rate words and map them to unk tokens ## Self-normalization - During inference (i.e., when applying a trained model to unseen data) we are interested in $p(w \mid c)$ and not $p\left(w^{\prime} \mid c\right)$, where $w^{\prime} \neq w$ - Unfortunately $b+U h$ does not yield probabilities and softmax requires summation over the entire output layer - 'Encourage' the neural network to produce probability-like values (Devlin et al., ACL-2014) without applying softmax - Softmax log likelihood: $ \log (P(x))=\log \left(\frac{\exp \left(U_{r}(x)\right)}{Z(x)}\right) $ where - $U_{r}(x)$ is the output layer score for $x$ - $Z(x)=\sum_{r^{\prime}=1}^{|V|} \exp \left(U_{r^{\prime}}(x)\right)$ $\log (P(x))=\log \left(U_{r}(x)\right)-\log (Z(x))$ - If we could ensure that $\log (Z(x))=0$ then we could use $\log \left(U_{r}(x)\right)$ directly - Strictly speaking not possible, but we can encourage the model by augmenting the loss function: $ L=\sum_{i}\left[-\log \left(P\left(x_{i}\right)\right)-\alpha\left(\log \left(Z\left(x_{i}\right)\right)^{2}\right]\right. $ - Self-normalization included during training; for inference, $\log (P(x))=\log \left(U_{r}(x)\right)$ - $\alpha$ regulates the importance of normalization (hyper-parameter): - Initialize output layer bias to $\log (1 /|V|)$ - Devlin et al. report speed-ups of around $15 x$ during inference - No speed-up during training --- ## References 1. DL4NLP Course, UvA 2021