# Recurrent Neural Networks (RNN)
## Characteristics of sequential data
Sequences are everywhere - videos, stock ecvhange, climate measurements, market analysis, speech, music, user behaviour, computer games etc.
Sequence -> Data are no more i.i.d.
- Each data point on the previous ones and their distribution changes over time
- How to model temporal dependencies, especially in high-dim complex data?
Sequences might be of arbitrary length
- How to model dependencies at different temporal scales?
- How to make a model that works for 10 sec and 10 min sequences?
Cycles in our computational graph
- How to learn with cycles?
Unclear how to assign credit to past time steps
- Which of the infinite past time steps is responsible for this observed present?
Chaotic behavior
- How do we account for all possible effects of small changes in past sequence?
Temporal redundancy
- In temporally dense series there is lots of redundancy. Curse or blessing?
## Inductive bias for sequences
$
\begin{array}{ll}
\hline \text { Challenges } & \text { Inductive bias } \\
\hline \text { Non iid data } & \text { State models, chain rule of probabilities } \\
\hline \text { Arbitrary lengths } & \text { Sharing weights } \\
\hline \text { Credit assignment problem } & \text { Backpropagation through time } \\
\hline \text { Chaotic behavior } & \text { LSTM, GRU } \\
\hline \text { Temporal redundancy } & \text { Spiking neural nets, slow feature learning } \\
\hline
\end{array}
$
## A sequence of probabilities
Sequences are by definition non iid. Then how do we compute the likelihood? By decomposing with [[Probability Theory#Chain rule]]:
$
p(x)=\prod_{i} p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)
$
And then model each term $p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)$ separately.
## Memory
To have the past influence present we must keep a summary of the past: A state, a memory, a representation of the past.
Memory can take all forms of shapes as long as it encodes the past Otherwise it is hard to keep track of distributional shifts.
At timestep $t$ project all previous information $1, \ldots, t$ onto a state space $s_{t}$. Memory controlled by a neural network $h$ with shared parameters $\theta$. Then, at timestep $t+1$ re-use the parameters $\theta_{t}$ and the previous $s_{t}$ such that
$
s_{t+1}=h_{t+1}\left(x_{t+1}, s_{t}\right)
$
Then apply recursive operation, often with [[Markov Chain]] assumption that any new state depends on the previous time step only
$
\begin{array}{l}
s_{t}=h_{t}\left(x_{t}, s_{t-1}\right) \\
s_{t-1}=h_{t-1}\left(x_{t-1}, s_{t-2}\right)
\end{array}
$
In the simplest case with fully connected layers: $\theta=\{U, V, W\}$
- The matrix $U$ transforms a vector from the input space to the state space
- The matrix $V$ transforms a vector from the the state space to the output space
- The matrix $W$ transforms the past state and input to the new state
![[memory-cycles.png]]
One of the problem is cycles in the computation graph. How do we handle it?
Write down each time step explicitly -> no cycles anymore
- In theory, a problem with infinite time steps back
- Can we do backpropagation with infinite time steps?
In practice, we cut the sequence short after a while. After all, does the 'very distant' past have influence to the present?
## Sequences of arbitrary lengths
We cannot, and ideally should not, have a constraint over sequence length.
The model should have no problem with
- varying
- unknown
- or even infinite
sequence lengths
One logical solution: sharing parameters through time $\theta_{t}=\theta_{t-1}=\cdots=\theta_{0}$
## Formulation of Recurrent Neural Networks
Putting things together, we have a simple RNN as follows:
$
\begin{array}{l}
s_{t}=\tanh \left(U_{t} x_{t}+W \cdot s_{t-1}\right) \\
y_{t}=\operatorname{softmax}\left(V \cdot s_{t}\right) \\
\mathcal{L}=\sum \mathcal{L}_{t}\left(y_{t}, l_{t}\right)
\end{array}
$
## Architectures
![[rnn_arch.jpeg]]
### Computational Graph
Same weight is shared in all time steps. In many to many architecture, Loss is accumulated from output at each time step.
![[rnn_weight.jpg]]
## Sequence Classification with RNNs
- How should we connect the RNN sentence representation to a classification layer?
- The sentence is represented by a sequence of hidden states $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle$
- There is not a single hidden state representing the entire sequence
Approach 1: Use the last hidden state of the unrolled RNN
- Connect $\mathbf{h}_{T}$ with classification layer: $\mathbf{o}=W_{o} \mathbf{h}_{T}+\mathbf{b}_{o}$
- How well does $\mathbf{h}_{T}$ capture information coming from input $x_{t}$ if $t \ll T$ ?
- in particular, if the correct classification depends on the actual input word $x_{t}(t \ll T)$
- how can we increase the potential impact of $x_{t}(t \ll T) ?$
Approach 2: Average all hidden states of the unrolled RNN
- Average all $\mathbf{h}_{t}$ in $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle: \mathbf{c}=\sum_{t=1}^{1} \frac{1}{T} \mathbf{h}_{t} \quad \mathbf{o}=W_{o} \mathbf{c}+\mathbf{b}_{o}$
- All $\mathbf{h}_{t}$ are scaled by the same value $\frac{1}{T}$
- but represent input prefixes of different lengths: $X_{\leq t}$
- maybe scale by $\alpha_{t}$, but how to define $\alpha_{t}$ ?
- what's the added value of averaging $\mathbf{h}_{t}$ instead of embeddings? (answer: contextual disambiguation of ambiguous input words)
- Can one fixed-sized vector really represent the full meaning of an entire sentence?
- No.
- ... but there are methods that allow for more fine-grained, flexible, on-demand meaning representations of entire sequences
- Do we really need to capture the full meaning for sequence classification?
- No. ... well, at least not for most cases
- but in some NLP applications this is needed: machine translation, summarization, ...
## Backpropagation
RNNs are trained with [[Backpropagation Through Time]].
## Variants
[[LSTM]]
[[GRU]]
---
## References
1. CS231N, Stanford 2015
2. Lecture 6, UvA Deep Learning 2020