Second order optimization

# Second order optimization With first order derivatives i.e. gradients, all weights are updated with same "aggressiveness". Often some parameters could enjoy more "teaching" while others are already about there. Second order derivative i.e. the curvature of loss surface can help us adapt the updates for each dimension. We can do this by adapting learning per parameter $ w_{t+1}=w_{t}-H_{\mathcal{L}}^{-1} \eta_{t} g_{t} $ $H_{\mathcal{L}}$ is the Hessian matrix of $\mathcal{L}$ : second-order derivatives $ H_{\mathcal{L}}^{i j}=\frac{\partial \mathcal{L}}{\partial w_{i} \partial w_{j}} $ Computing Hessian is very expensive as there are too many parameters in deep models, and the inverse itself is of cubic complexity. We can approximate the Hessian, e.g. with the L-BFGS algorithm. It keeps memory of gradients to approximate the inverse Hessian. L-BFGS works alright with Gradient Descent. But not so much about [[Stochastic Gradient Descent]]? In practice, [[Optimizers#SGD with momentum]] works just fine quite often. --- ## References 1. Lecture 3.1, UvA DL course 2020