Backpropagation Through Time

# Backpropagation Through Time How are RNNs and MLPs different? - Main difference: Layer-shared parameters vs Layer-specific parameters. Just mentally have to switch from 'time steps' to 'layers'. - Question: how to train with shared parameters? Backpropagation... through time. ![[unfolding-rnns.jpg]] BPPT is basically, chain rule again for $\frac{d \mathcal{L}}{d V}, \frac{d \mathcal{L}}{d U}, \frac{d \mathcal{L}}{d W}$ which is the same algorithm as normal [[Backpropagation]]. But there is one caveat: shared computations complicate the chain rule: $ \begin{array}{l} y_{t}=\operatorname{softmax}\left(V \cdot \boldsymbol{s}_{t}\right) \\ \boldsymbol{s}_{t}=\tanh \left(U \cdot \boldsymbol{x}_{t}+W \cdot \boldsymbol{s}_{t-1}\right) \end{array} $ $V, U, W$ are same for $t, t+1, \ldots$ - Gradients flow not from a 'single path' of previous layer like in MLP - The recurrence in the chain rule 'hides' multiple dependencies ## Chain rule for dS/dW and dS/dV in unfolded graph Dependencies from $s_{t}$ in separate covariate variables $ s=h\left(s_{t}, s_{t-1}, \ldots, s_{0}\right) $ All gradient path flows from $s$ to $w$ Via $s_{t}$ Via $s_{t-1}$ $ \frac{d s}{d w}=\sum_{i=0}^{t} \frac{d s}{d s_{i}} \cdot \frac{d s_{i}}{d w} $ ## Chain rule by change of variable differentiation - Same result as above but more involved - One must keep in mind that the nonlinearity $h$ and the derivative acts within 'one layer' No recursion for $\mathrm{h},$ only via $s_{i}$ ## BPPT for memory and input parameters dL/dW, dL/dU Putting everything together, $ \frac{\partial \mathcal{L}_{t}}{\partial W}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial s_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\prod_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}} $ If you have a new loss per time step, sum over time steps $ \frac{\partial \mathcal{L}}{\partial \boldsymbol{W}}=\sum_{t} \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}} $ Similar for input parameters $\boldsymbol{U}$ $ \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{U}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial \boldsymbol{s}_{t}}{\partial \boldsymbol{s}_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{U}} $ ## Truncating BPTT - Unrolling forever is not practical or even feasible - Truncate to $t_{\text {trunc}}$ is a usual strategy. Then, replace all $t$ in the equations before with $t_{\text {trunc}}$ - More focus on short-term terms - Not undesirable, as long-term terms may be irrelevant anyway - 'Biases' towards simpler models with shorter-term dependencies Note: Good slides for recursive gradient computation [[#^6d978f| [2] ]] ## Challenges [[Vanishing and Exploding Gradients]] - It is a big problem in RNNs are MLPs with infinitely deep layers with shared parameters, which is the main problem $ \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\boldsymbol{\prod}_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}} $ $ \begin{array}{l} \circ \text { If } \frac{\partial s_{j}}{\partial s_{j-1}}<1 \Rightarrow \frac{\partial \mathcal{L}_{t}}{\partial W} \ll 1 \rightarrow \text { Vanishing gradient } \\ \text { o If } \frac{\partial s_{j}}{\partial s_{j-1}}>1 \Rightarrow \frac{\partial \ell_{t}}{\partial W} \gg 1 \rightarrow \text { Explodinggradient } \end{array} $ - Exponentially similar contribution of longer-term terms, so model emphasizes on shorter-term terms as they have larger gradients. - Possible solution is to clipping the exploding gradients as they are an optimization issue, but we cannot rescale vanishing gradients because it is a gradient accuracy issue. In any case rescaling only focuses on short-term. Misalignment between gradient and weights - For every step we use 'different versions over time' of various variables - The new gradients are only an estimate, so it gets worse with more backpropagation - Doing fewer updates helps but might slow down training Bias due to truncation - Instead of computing the real gradient for all time steps $0,1,2, \ldots, t$, we compute a gradient approximation up to $t_{\text {trunc}}$ - In practice and for many applications, not it doesn't matter very much. - However, it would still be nice to have the model choose what to ignore. --- ## References 1. Lecture 4.2, UvA DL course 2020 2. Computing the Gradient in an RNN, Sargur Srihari https://cedar.buffalo.edu/~srihari/CSE676/10.2.2%20RNN-Gradients.pdf ^6d978f