# Scaling Deep Neural Networks
- State-of-the-art models
- have many parameters
- are trained on large data sets
- This is challenging in terms of
- fitting models into GPU memory
- time it takes to train a model
- How can we exploit parallelization across multiple GPUs to address both issues?
- Two general types of parallelism:
- data parallelism: split the data into smaller chunks and have each GPU process a chunk in parallel
- model parallelism: split the model into sub-models where each GPU computes only the forward/backward passes of its sub-model
## Data Parallelism
Data parallelism can help address two issues:
- increase speed by performing computations in parallel across multiple GPUs
- reduce the per GPU memory usage by processing smaller chunks on each GPU
Let's look at a typical language modeling task, where
- all sentences in the training data $D$ are grouped into batches of sentences
- $B \in \mathbb{N}_{0}^{b \times(n+1)}$, where $b=$ batch size, $n+1=$ length of longest sequence in $B$, where each word is already mapped to a unique (positive) integer and $B_{i, j}=0$ is used for padding
- the input and target output are defined as $X=B[\cdot ; 1, n]$, and $Y=B[\cdot ; 2, n+1]$, respectively $(B[\cdot ; 2, n+1]=2$ nd dimension matrix slice from 2 to $n+1$, and hence $\left.X, Y \in \mathbb{N}_{0}^{b \times n}\right)$
- a neural (RNN) network: $\hat{Y}=\operatorname{net}_{\theta}(X)$, with a (loss) criterion: $\operatorname{crit}\left(\operatorname{net}_{\theta}(X), Y\right)$, where $\theta$ are the learnable parameters of the network
- Instead of using $X, Y \in \mathbb{N}_{0}^{b \times n}$ to compute $\operatorname{crit}\left(\right.$ net $\left._{\theta}(X), Y\right)$
- split $X, Y$ into $m$ smaller sub-batches where $m$ is the number of available GPUs where $X_{i}, Y_{i} \in \mathbb{N}_{0}^{b_{i} \times n}$, s.t. $b_{1} \approx b_{2} \cdots \approx b_{m}$ and $b=\sum_{i=1}^{m} b_{i}$
- for $1 \leq i \leq m:$ copy $n e t_{\theta}$, crit, $\left(X_{i}, Y_{i}\right)$ to $\mathrm{GPU}_{i}$
- in parallel compute: first $\overrightarrow{\text { crit }}_{i}(\underbrace{\overrightarrow{n e t}_{i, \theta}\left(X_{i}\right)}_{=\hat{Y}_{i}}, Y_{i})$ (forward pass)
and then $\overleftarrow{n e t}_{i, \theta}\left(X_{i}, \overleftarrow{\text { crit }}_{i}\left(\hat{Y}_{i}, Y_{i}\right)\right)$ (backward pass)
results in parameter gradients: $\nabla_{i} \theta$ (living on $\left.\mathrm{GPU}_{i}\right)$
- copy $\nabla_{i} \theta$ to $\mathrm{GPU}_{1}: \nabla \theta=\sum_{i=1}^{m} \nabla_{i} \theta$ (maybe also average)
- on $\mathrm{GPU}_{1}: \theta \leftarrow \theta+\eta(-\nabla \theta)$
- pick next $B$ from $D$ and repeat
How much speed-up do we get from using m GPUs?
- depends mostly on $|\theta|$ which needs to be copied to and from each GPU for each batch (update)
- common observation: 4XGPU ~= 3X speed-up
Side-effect of data parallelism: batch sizes
- since $X_{i}, Y_{i}$ are $\frac{1}{m}$ of the size of $X, Y$, they use less memory on $\mathrm{GPU}_{i}$ than $X, Y$ would have on a single GPU
- this allows to increase b by a factor of m
Larger batch sizes
* exploit efficient matrix computations to a larger degree
* result in more stable batch gradients
- reduce the frequency of updates (proportional to data size) and may require a larger learning rate
## Model Parallelism
Models can use too much memory to run on a single GPU
- GPU memory typically 8-12GB (24GB if you're lucky)
- memory usage depends on
- number of parameters
- the size of the computational graph
- e.g., when training an LSTM, temporary results for all time steps are stored to compute the backward pass (more efficiently) leading to substantial memory usage
Solutions:
- model parallelism: if the network has sub-networks that share no input/output dependencies with each other: compute in parallel on different GPUs (speed-up and less memory usage)
- model distribution: if the network has sub-networks that can be computed in strict sequence, those sub-networks can be sequentially computed on different GPUs (no speed-up, but less memory usage)
### Model Distribution
- $\operatorname{net}_{\theta}(X)=\operatorname{net}_{m, \theta_{m}}\left(\ldots \operatorname{net}_{2, \theta_{2}}\left(\operatorname{net}_{1, \theta_{1}}(X)\right) \ldots\right)$
- Each net $_{i, \theta_{i}}$ has its own parameters, where $\theta=\bigcup_{i=1}^{m} \theta_{i}$
- If $\bigcap_{i=1}^{m} \theta_{i}=\emptyset:$ we can update $\theta_{i} \leftarrow \theta_{i}+\eta\left(-\frac{\partial X_{i}}{\partial \theta_{i}}\right)$ immediately, i.e., no need to accumulate gradients and copy any $\nabla \theta_{i}$ between GPUs
- If $\bigcap_{i=1}^{m} \theta_{i} \neq \emptyset:$ copy all $\nabla \theta_{i}$ to $\mathrm{GPU}_{1}$, compute $\nabla \theta=\sum_{i=1}^{m} \nabla \theta_{i}$, $\theta \leftarrow \theta+\eta(-\nabla \theta)$, and copy the new $\theta_{i}$ values to $\mathrm{GPU}_{i}$
Slow-down depends on the number of
- parameter gradients that need to be copied between GPUs
- activation values $\left(X_{i}\right)$ and gradients $\left(G_{i}\right)$ that need to be copied between GPUs
---
## References