Regularized Least Squares

# Regularized Least Squares When the values of learned parameters are very large, they tend to overfit to the solution. To prevent this, we can introduce a heuristically driven term in error function whose goal is to penalize these large values in the weight vectors. $ \tilde{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left\{t_{i}-y\left(\mathbf{x}_{i}, \mathbf{w}\right)\right\}^{2}+\frac{1}{2} \lambda \mathbf{w}^{T} \mathbf{w} $ Note tha this is equivalent to [[Maximum A Posteriori (MAP)]] for estimating $\mathbf{w}$ with a gaussian prior. We can also observe, $ \lambda = \frac{\alpha}{\beta} $ This can be interpreted as $\alpha$ representing how much confidence we have in our prior, and $\beta$ representing how much confidence we have in our model. In a more general terms, this regularization part can be written as: $ \hat{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left(t_{i}-\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)\right)^{2}+\frac{\lambda}{2} \sum_{i=1}^{M}\left|\mathbf{w}\right|^q $ When $q=1$, the regularization term is known as _lasso_. It encourages sparsity in $\mathbf{w}$. When $q=2$, the term is called _ridge_ in statistics literature or _weight decay_ in machine learning. It penalizes large weights to have smaller values. ---