Adaptive Learning Rate Optimizers

# Adaptive Learning Rate Optimizers Adam and similar adaptive methods (AdaGrad, RMSprop) are so successful because they provide a cheap approximation to accounting for local curvature. They generally rely on the fact that: Directions with consistently large gradients often correspond to directions of high curvature. - High curvature $\rightarrow$ large gradients $\rightarrow$ smaller effective step size - Low curvature $\rightarrow$ smaller gradients $\rightarrow$ larger effective step size By reducing the effective learning rate in these directions, adaptive LR optimizers help prevent problems in complex loss landscapes - without ever computing a Hessian! ## SGD with momentum We don't switch update direction all the time. We maintain "momentum" from previous updates, which helps dampens oscillations. $ \begin{array}{l} u_{t+1}=\gamma u_{t}-\eta_{t} g_{t} \\ w_{t+1}=w_{t}+u_{t+1} \end{array} $ Exponential averaging keeps steady direction by cancelling the oscillating gradients and gives more weight to recent updates, as can be seen in the example below: $ \begin{array}{l} \text { Example: } \gamma=0.9 \text { and } u_{0}=0 \\ u_{1} \propto-g_{1} \\ u_{2} \propto-0.9 g_{1}-g_{2} \\ u_{3} \propto-0.81 g_{1}-0.9 g_{2}-g_{3} \end{array} $ It leads to more robust gradients and learning, leading to faster convergence. ## RMSprop In RMSprop each component of the gradient is rescaled by an exponentially decaying sum of its magnitude history. This means large gradients will get dampened and small ones amplified, relatively to each other. $ \begin{array}{l} r_{t}=\alpha r_{t-1}+(1-\alpha) g_{t}^{2} \\ u_{t}=-\frac{\eta}{\sqrt{r_{t}}+\varepsilon} g_{t} \\ w_{t+1}=w_{t}+u_{t} \end{array} $ Here, $\alpha$ is a decay hyperparameter (usually 0.9). It's effects can be analyzed as: 1. Large gradients, e.g. too "noisy" loss surface, Updates are tamed 2. Small gradients, e.g. stuck in plateau of loss surface, Updates become more aggressive 3. Sort of performs simulated annealing ![[rmsprop-gradients.jpg]] RMSprop helps in situations with heavy ravines/plateus. ## Adam One of the most popular learning algorithm. Adam uses a first-order momentum estimate of the gradient as the direction of its update, like SGD with momentum. Adam then rescales the components of that estimate by an estimate of the second-order momentum, like RMSprop. $ \begin{array}{c} m_{t}=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_{t} \\ v_{t}=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right) g_{t}^{2} \\ \hat{m}_{t}=\frac{m_{t}}{1-\beta_{1}^{t}}, \hat{v}_{t}=\frac{v_{t}}{1-\beta_{2}^{t}} \\ u_{t}=-\frac{\eta}{\sqrt{\hat{v}_{t}}+\varepsilon} \hat{m}_{t} \\ w_{t+1}=w_{t}+u_{t} \end{array} $ In addition, for both momenta, Adam corrects the estimates to account for bias from having few samples in the beginning of the procedure. Works really well, go-to choice in complex models. However, Adam tends to "over-optimize". If you expect data to be noisy, Adam might converge to suboptimal and better off with SGD with momentum. ## Adagrad Schedule: $ r=\Sigma_{t}\left(\nabla_{w} \mathcal{L}\right)^{2} \Rightarrow w_{t+1}=w_{t}-\eta \frac{g_{t}}{\sqrt{r}+\varepsilon} $ Gradients become gradually smaller and smaller. ## Nesterov Momentum Use the future gradient instead of the current gradient $ \begin{aligned} w_{t+0.5} &=w_{t}+\gamma u_{\tau} \\ u_{t+1} &=\gamma u_{\tau}-\eta_{t} \nabla_{w_{t+0.5}} \mathcal{L} \\ w_{t+1} &=w_{t}+u_{t+1} \end{aligned} $ It has better theoritical convergence. Generally works well with convolutional neural networks.