Stochastic Gradient Descent

# Stochastic Gradient Descent Batch techniques, such as [[Linear Regression via Maximum Likelihood]], involve processing the entire training set in one go, which can be computationally infeasible. Stochastic gradient descent is a type of sequential learning algorithm in which data points are considered one at a time, and the model parameters are updated after each such presentation. If the error function comprises of sum over data pints $E=\sum_{n} E_{n}$, then the optimization is good candidate for using SGD. The parameter vector $\mathbf{w}$ is updated after iteration $\tau$ with $\eta$ step size as follows: $\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E_{n}$ The gradient of the error function $\nabla E_{n}$ encodes all the directional derrivatives. This gradient is always perperndicular to the contours of the error function. Thus the gradient always points in the direction of the steepest ascent, and we update paramter in the opposite direction scaled by step size $\eta$. ## Convergence guarentee If $E_{D}(\mathbf{w})$ is convex in $\mathbf{w}$ and $\eta$ is small enough, stochastic gradient descent is guarenteed to converge to the optima. ---