# Probabilistic Neural Network Language Models
- Count-based language models use discrete parameters as elements of the event space
- The current word+history n-gram has or has not been seen during training (binary decision)
- Smoothing results in a more relaxed matching criterion
- Probabilistic Neural Network LMs (Bengio et al. JMLR 2003) use a distributed real-valued representation of words and contexts
- Each word in the vocabulary is mapped to a $m$-dimensional real-valued vector
- $C(w) \in \mathbb{R}^{m}$, typical values for $m$ are $50,100,150$
- A hidden layer capture the contextual dependencies between words in an $\mathrm{n}$-gram
- The output layer is a $|V|$-dimensional vector describing the probability distribution of $p\left(w_{i} \mid w_{i-n+1}^{i-1}\right)$
## Architecture
![[PNNLM.png]]
Layer-1 (projection layer)
$C\left(w_{t-i}\right)=C w_{t-i}$
where
- $w_{t-i}$ is a $V$-dimensional 1-hot vector, i.e., a zero-vector where only the index corresponding the word occurring at position $t-i$ is 1
- $C$ is a $m \times V$ matrix
Layer-2 (context layer)
$h=\tanh (d+H x)$
where
- $x=\left[C\left(w_{t-n+1}\right) ; \ldots ; C\left(w_{t-1}\right)\right] \quad([\cdot ; \cdot]=$ vector concatenation)
- $H$ is a $n \times(l-1) m$ matrix
Layer-3 (output layer)
$\hat{y}=\operatorname{softmax}(b+U h)$
$\hat{y}=\operatorname{softmax}(b+U h)$
where
- $U$ is a $V \times n$ matrix
- $\operatorname{softmax}(v)=\frac{\exp \left(v_{i}\right)}{\sum_{i} \exp \left(v_{i}\right)}$ (turns activations into probs)
Optional: skip-layer connections
$\hat{y}=\operatorname{softmax}(b+W x+U h)$ where
- $W$ is a $V \times(l-1) m$ matrix (skipping the non-linear context layer)
Training: Use cross entropy and update all params including C (projections)
## Projection
C captures:
- maps discrete words to continuous, low dimensional vectors
- $C$ is shared across all contexts
- $C$ is position-independent
- if $C$ (white $) \approx C($ red $)$ then $p($ drives $\mid$ a white car $) \approx p($ drives $\mid a$ red car $)$
## Advantages
- PNLMs outperform n-gram based language models (in terms of perplexity)
- Use limited amount of memory
- NPLM: $\sim 100 \mathrm{M}$ floats $\approx 400 \mathrm{M}$ RAM
- n-gram model: 〜50-100G RAM
## Disadvantages:
- Computationally expensive
- Mostly due to large output layer (size of vocabulary): $U h$ can involve hundreds of millions of operations!
- We want to know $p(w \mid C)$ for a specific $w$, but to do so we need softmax over entire output layer
- Quick-fix: Use short-lists i.e. ignore rate words and map them to unk tokens
## Self-normalization
- During inference (i.e., when applying a trained model to unseen data) we are interested in $p(w \mid c)$ and not $p\left(w^{\prime} \mid c\right)$, where $w^{\prime} \neq w$
- Unfortunately $b+U h$ does not yield probabilities and softmax requires summation over the entire output layer
- 'Encourage' the neural network to produce probability-like values (Devlin et al., ACL-2014) without applying softmax
- Softmax log likelihood:
$
\log (P(x))=\log \left(\frac{\exp \left(U_{r}(x)\right)}{Z(x)}\right)
$
where
- $U_{r}(x)$ is the output layer score for $x$
- $Z(x)=\sum_{r^{\prime}=1}^{|V|} \exp \left(U_{r^{\prime}}(x)\right)$
$\log (P(x))=\log \left(U_{r}(x)\right)-\log (Z(x))$
- If we could ensure that $\log (Z(x))=0$ then we could use $\log \left(U_{r}(x)\right)$ directly
- Strictly speaking not possible, but we can encourage the model by augmenting the loss function:
$
L=\sum_{i}\left[-\log \left(P\left(x_{i}\right)\right)-\alpha\left(\log \left(Z\left(x_{i}\right)\right)^{2}\right]\right.
$
- Self-normalization included during training; for inference, $\log (P(x))=\log \left(U_{r}(x)\right)$
- $\alpha$ regulates the importance of normalization (hyper-parameter):
- Initialize output layer bias to $\log (1 /|V|)$
- Devlin et al. report speed-ups of around $15 x$ during inference
- No speed-up during training
---
## References
1. DL4NLP Course, UvA 2021