# Recurrent Neural Networks (RNN) ## Characteristics of sequential data Sequences are everywhere - videos, stock ecvhange, climate measurements, market analysis, speech, music, user behaviour, computer games etc. Sequence -> Data are no more i.i.d. - Each data point on the previous ones and their distribution changes over time - How to model temporal dependencies, especially in high-dim complex data? Sequences might be of arbitrary length - How to model dependencies at different temporal scales? - How to make a model that works for 10 sec and 10 min sequences? Cycles in our computational graph - How to learn with cycles? Unclear how to assign credit to past time steps - Which of the infinite past time steps is responsible for this observed present? Chaotic behavior - How do we account for all possible effects of small changes in past sequence? Temporal redundancy - In temporally dense series there is lots of redundancy. Curse or blessing? ## Inductive bias for sequences $ \begin{array}{ll} \hline \text { Challenges } & \text { Inductive bias } \\ \hline \text { Non iid data } & \text { State models, chain rule of probabilities } \\ \hline \text { Arbitrary lengths } & \text { Sharing weights } \\ \hline \text { Credit assignment problem } & \text { Backpropagation through time } \\ \hline \text { Chaotic behavior } & \text { LSTM, GRU } \\ \hline \text { Temporal redundancy } & \text { Spiking neural nets, slow feature learning } \\ \hline \end{array} $ ## A sequence of probabilities Sequences are by definition non iid. Then how do we compute the likelihood? By decomposing with [[Probability Theory#Chain rule]]: $ p(x)=\prod_{i} p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right) $ And then model each term $p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)$ separately. ## Memory To have the past influence present we must keep a summary of the past: A state, a memory, a representation of the past. Memory can take all forms of shapes as long as it encodes the past Otherwise it is hard to keep track of distributional shifts. At timestep $t$ project all previous information $1, \ldots, t$ onto a state space $s_{t}$. Memory controlled by a neural network $h$ with shared parameters $\theta$. Then, at timestep $t+1$ re-use the parameters $\theta_{t}$ and the previous $s_{t}$ such that $ s_{t+1}=h_{t+1}\left(x_{t+1}, s_{t}\right) $ Then apply recursive operation, often with [[Markov Chain]] assumption that any new state depends on the previous time step only $ \begin{array}{l} s_{t}=h_{t}\left(x_{t}, s_{t-1}\right) \\ s_{t-1}=h_{t-1}\left(x_{t-1}, s_{t-2}\right) \end{array} $ In the simplest case with fully connected layers: $\theta=\{U, V, W\}$ - The matrix $U$ transforms a vector from the input space to the state space - The matrix $V$ transforms a vector from the the state space to the output space - The matrix $W$ transforms the past state and input to the new state ![[memory-cycles.png]] One of the problem is cycles in the computation graph. How do we handle it? Write down each time step explicitly -> no cycles anymore - In theory, a problem with infinite time steps back - Can we do backpropagation with infinite time steps? In practice, we cut the sequence short after a while. After all, does the 'very distant' past have influence to the present? ## Sequences of arbitrary lengths We cannot, and ideally should not, have a constraint over sequence length. The model should have no problem with - varying - unknown - or even infinite sequence lengths One logical solution: sharing parameters through time $\theta_{t}=\theta_{t-1}=\cdots=\theta_{0}$ ## Formulation of Recurrent Neural Networks Putting things together, we have a simple RNN as follows: $ \begin{array}{l} s_{t}=\tanh \left(U_{t} x_{t}+W \cdot s_{t-1}\right) \\ y_{t}=\operatorname{softmax}\left(V \cdot s_{t}\right) \\ \mathcal{L}=\sum \mathcal{L}_{t}\left(y_{t}, l_{t}\right) \end{array} $ ## Architectures ![[rnn_arch.jpeg]] ### Computational Graph Same weight is shared in all time steps. In many to many architecture, Loss is accumulated from output at each time step. ![[rnn_weight.jpg]] ## Sequence Classification with RNNs - How should we connect the RNN sentence representation to a classification layer? - The sentence is represented by a sequence of hidden states $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle$ - There is not a single hidden state representing the entire sequence Approach 1: Use the last hidden state of the unrolled RNN - Connect $\mathbf{h}_{T}$ with classification layer: $\mathbf{o}=W_{o} \mathbf{h}_{T}+\mathbf{b}_{o}$ - How well does $\mathbf{h}_{T}$ capture information coming from input $x_{t}$ if $t \ll T$ ? - in particular, if the correct classification depends on the actual input word $x_{t}(t \ll T)$ - how can we increase the potential impact of $x_{t}(t \ll T) ?$ Approach 2: Average all hidden states of the unrolled RNN - Average all $\mathbf{h}_{t}$ in $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle: \mathbf{c}=\sum_{t=1}^{1} \frac{1}{T} \mathbf{h}_{t} \quad \mathbf{o}=W_{o} \mathbf{c}+\mathbf{b}_{o}$ - All $\mathbf{h}_{t}$ are scaled by the same value $\frac{1}{T}$ - but represent input prefixes of different lengths: $X_{\leq t}$ - maybe scale by $\alpha_{t}$, but how to define $\alpha_{t}$ ? - what's the added value of averaging $\mathbf{h}_{t}$ instead of embeddings? (answer: contextual disambiguation of ambiguous input words) - Can one fixed-sized vector really represent the full meaning of an entire sentence? - No. - ... but there are methods that allow for more fine-grained, flexible, on-demand meaning representations of entire sequences - Do we really need to capture the full meaning for sequence classification? - No. ... well, at least not for most cases - but in some NLP applications this is needed: machine translation, summarization, ... ## Backpropagation RNNs are trained with [[Backpropagation Through Time]]. ## Variants [[LSTM]] [[GRU]] --- ## References 1. CS231N, Stanford 2015 2. Lecture 6, UvA Deep Learning 2020