Scaling Deep Neural Networks

# Scaling Deep Neural Networks - State-of-the-art models - have many parameters - are trained on large data sets - This is challenging in terms of - fitting models into GPU memory - time it takes to train a model - How can we exploit parallelization across multiple GPUs to address both issues? - Two general types of parallelism: - data parallelism: split the data into smaller chunks and have each GPU process a chunk in parallel - model parallelism: split the model into sub-models where each GPU computes only the forward/backward passes of its sub-model ## Data Parallelism Data parallelism can help address two issues: - increase speed by performing computations in parallel across multiple GPUs - reduce the per GPU memory usage by processing smaller chunks on each GPU Let's look at a typical language modeling task, where - all sentences in the training data $D$ are grouped into batches of sentences - $B \in \mathbb{N}_{0}^{b \times(n+1)}$, where $b=$ batch size, $n+1=$ length of longest sequence in $B$, where each word is already mapped to a unique (positive) integer and $B_{i, j}=0$ is used for padding - the input and target output are defined as $X=B[\cdot ; 1, n]$, and $Y=B[\cdot ; 2, n+1]$, respectively $(B[\cdot ; 2, n+1]=2$ nd dimension matrix slice from 2 to $n+1$, and hence $\left.X, Y \in \mathbb{N}_{0}^{b \times n}\right)$ - a neural (RNN) network: $\hat{Y}=\operatorname{net}_{\theta}(X)$, with a (loss) criterion: $\operatorname{crit}\left(\operatorname{net}_{\theta}(X), Y\right)$, where $\theta$ are the learnable parameters of the network - Instead of using $X, Y \in \mathbb{N}_{0}^{b \times n}$ to compute $\operatorname{crit}\left(\right.$ net $\left._{\theta}(X), Y\right)$ - split $X, Y$ into $m$ smaller sub-batches where $m$ is the number of available GPUs where $X_{i}, Y_{i} \in \mathbb{N}_{0}^{b_{i} \times n}$, s.t. $b_{1} \approx b_{2} \cdots \approx b_{m}$ and $b=\sum_{i=1}^{m} b_{i}$ - for $1 \leq i \leq m:$ copy $n e t_{\theta}$, crit, $\left(X_{i}, Y_{i}\right)$ to $\mathrm{GPU}_{i}$ - in parallel compute: first $\overrightarrow{\text { crit }}_{i}(\underbrace{\overrightarrow{n e t}_{i, \theta}\left(X_{i}\right)}_{=\hat{Y}_{i}}, Y_{i})$ (forward pass) and then $\overleftarrow{n e t}_{i, \theta}\left(X_{i}, \overleftarrow{\text { crit }}_{i}\left(\hat{Y}_{i}, Y_{i}\right)\right)$ (backward pass) results in parameter gradients: $\nabla_{i} \theta$ (living on $\left.\mathrm{GPU}_{i}\right)$ - copy $\nabla_{i} \theta$ to $\mathrm{GPU}_{1}: \nabla \theta=\sum_{i=1}^{m} \nabla_{i} \theta$ (maybe also average) - on $\mathrm{GPU}_{1}: \theta \leftarrow \theta+\eta(-\nabla \theta)$ - pick next $B$ from $D$ and repeat How much speed-up do we get from using m GPUs? - depends mostly on $|\theta|$ which needs to be copied to and from each GPU for each batch (update) - common observation: 4XGPU ~= 3X speed-up Side-effect of data parallelism: batch sizes - since $X_{i}, Y_{i}$ are $\frac{1}{m}$ of the size of $X, Y$, they use less memory on $\mathrm{GPU}_{i}$ than $X, Y$ would have on a single GPU - this allows to increase b by a factor of m Larger batch sizes * exploit efficient matrix computations to a larger degree * result in more stable batch gradients - reduce the frequency of updates (proportional to data size) and may require a larger learning rate ## Model Parallelism Models can use too much memory to run on a single GPU - GPU memory typically 8-12GB (24GB if you're lucky) - memory usage depends on - number of parameters - the size of the computational graph - e.g., when training an LSTM, temporary results for all time steps are stored to compute the backward pass (more efficiently) leading to substantial memory usage Solutions: - model parallelism: if the network has sub-networks that share no input/output dependencies with each other: compute in parallel on different GPUs (speed-up and less memory usage) - model distribution: if the network has sub-networks that can be computed in strict sequence, those sub-networks can be sequentially computed on different GPUs (no speed-up, but less memory usage) ### Model Distribution - $\operatorname{net}_{\theta}(X)=\operatorname{net}_{m, \theta_{m}}\left(\ldots \operatorname{net}_{2, \theta_{2}}\left(\operatorname{net}_{1, \theta_{1}}(X)\right) \ldots\right)$ - Each net $_{i, \theta_{i}}$ has its own parameters, where $\theta=\bigcup_{i=1}^{m} \theta_{i}$ - If $\bigcap_{i=1}^{m} \theta_{i}=\emptyset:$ we can update $\theta_{i} \leftarrow \theta_{i}+\eta\left(-\frac{\partial X_{i}}{\partial \theta_{i}}\right)$ immediately, i.e., no need to accumulate gradients and copy any $\nabla \theta_{i}$ between GPUs - If $\bigcap_{i=1}^{m} \theta_{i} \neq \emptyset:$ copy all $\nabla \theta_{i}$ to $\mathrm{GPU}_{1}$, compute $\nabla \theta=\sum_{i=1}^{m} \nabla \theta_{i}$, $\theta \leftarrow \theta+\eta(-\nabla \theta)$, and copy the new $\theta_{i}$ values to $\mathrm{GPU}_{i}$ Slow-down depends on the number of - parameter gradients that need to be copied between GPUs - activation values $\left(X_{i}\right)$ and gradients $\left(G_{i}\right)$ that need to be copied between GPUs --- ## References