# Backpropagation Through Time
How are RNNs and MLPs different?
- Main difference: Layer-shared parameters vs Layer-specific parameters. Just mentally have to switch from 'time steps' to 'layers'.
- Question: how to train with shared parameters? Backpropagation... through time.
![[unfolding-rnns.jpg]]
BPPT is basically, chain rule again for $\frac{d \mathcal{L}}{d V}, \frac{d \mathcal{L}}{d U}, \frac{d \mathcal{L}}{d W}$ which is the same algorithm as normal [[Backpropagation]]. But there is one caveat: shared computations complicate the chain rule:
$
\begin{array}{l}
y_{t}=\operatorname{softmax}\left(V \cdot \boldsymbol{s}_{t}\right) \\
\boldsymbol{s}_{t}=\tanh \left(U \cdot \boldsymbol{x}_{t}+W \cdot \boldsymbol{s}_{t-1}\right)
\end{array}
$
$V, U, W$ are same for $t, t+1, \ldots$
- Gradients flow not from a 'single path' of previous layer like in MLP
- The recurrence in the chain rule 'hides' multiple dependencies
## Chain rule for dS/dW and dS/dV in unfolded graph
Dependencies from $s_{t}$ in separate covariate variables
$
s=h\left(s_{t}, s_{t-1}, \ldots, s_{0}\right)
$
All gradient path flows from $s$ to $w$ Via $s_{t}$ Via $s_{t-1}$
$
\frac{d s}{d w}=\sum_{i=0}^{t} \frac{d s}{d s_{i}} \cdot \frac{d s_{i}}{d w}
$
## Chain rule by change of variable differentiation
- Same result as above but more involved
- One must keep in mind that the nonlinearity $h$ and the derivative acts within 'one layer' No recursion for $\mathrm{h},$ only via $s_{i}$
## BPPT for memory and input parameters dL/dW, dL/dU
Putting everything together,
$
\frac{\partial \mathcal{L}_{t}}{\partial W}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial s_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\prod_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}
$
If you have a new loss per time step, sum over time steps
$
\frac{\partial \mathcal{L}}{\partial \boldsymbol{W}}=\sum_{t} \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}}
$
Similar for input parameters $\boldsymbol{U}$
$
\frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{U}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial \boldsymbol{s}_{t}}{\partial \boldsymbol{s}_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{U}}
$
## Truncating BPTT
- Unrolling forever is not practical or even feasible
- Truncate to $t_{\text {trunc}}$ is a usual strategy. Then, replace all $t$ in the equations before with $t_{\text {trunc}}$
- More focus on short-term terms
- Not undesirable, as long-term terms may be irrelevant anyway
- 'Biases' towards simpler models with shorter-term dependencies
Note: Good slides for recursive gradient computation [[#^6d978f| [2] ]].
## Challenges
[[Vanishing and Exploding Gradients]]
- It is a big problem in RNNs are MLPs with infinitely deep layers with shared parameters, which is the main problem
$
\frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\boldsymbol{\prod}_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}
$
$
\begin{array}{l}
\circ \text { If } \frac{\partial s_{j}}{\partial s_{j-1}}<1 \Rightarrow \frac{\partial \mathcal{L}_{t}}{\partial W} \ll 1 \rightarrow \text { Vanishing gradient } \\
\text { o If } \frac{\partial s_{j}}{\partial s_{j-1}}>1 \Rightarrow \frac{\partial \ell_{t}}{\partial W} \gg 1 \rightarrow \text { Explodinggradient }
\end{array}
$
- Exponentially similar contribution of longer-term terms, so model emphasizes on shorter-term terms as they have larger gradients.
- Possible solution is to clipping the exploding gradients as they are an optimization issue, but we cannot rescale vanishing gradients because it is a gradient accuracy issue. In any case rescaling only focuses on short-term.
Misalignment between gradient and weights
- For every step we use 'different versions over time' of various variables
- The new gradients are only an estimate, so it gets worse with more backpropagation
- Doing fewer updates helps but might slow down training
Bias due to truncation
- Instead of computing the real gradient for all time steps $0,1,2, \ldots, t$, we compute a gradient approximation up to $t_{\text {trunc}}$
- In practice and for many applications, not it doesn't matter very much.
- However, it would still be nice to have the model choose what to ignore.
---
## References
1. Lecture 4.2, UvA DL course 2020
2. Computing the Gradient in an RNN, Sargur Srihari https://cedar.buffalo.edu/~srihari/CSE676/10.2.2%20RNN-Gradients.pdf ^6d978f