# Regularized Least Squares
When the values of learned parameters are very large, they tend to overfit to the solution. To prevent this, we can introduce a heuristically driven term in error function whose goal is to penalize these large values in the weight vectors.
$
\tilde{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left\{t_{i}-y\left(\mathbf{x}_{i}, \mathbf{w}\right)\right\}^{2}+\frac{1}{2} \lambda \mathbf{w}^{T} \mathbf{w}
$
Note tha this is equivalent to [[Maximum A Posteriori (MAP)]] for estimating $\mathbf{w}$ with a gaussian prior. We can also observe,
$
\lambda = \frac{\alpha}{\beta}
$
This can be interpreted as $\alpha$ representing how much confidence we have in our prior, and $\beta$ representing how much confidence we have in our model.
In a more general terms, this regularization part can be written as:
$
\hat{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left(t_{i}-\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)\right)^{2}+\frac{\lambda}{2} \sum_{i=1}^{M}\left|\mathbf{w}\right|^q
$
When $q=1$, the regularization term is known as _lasso_. It encourages sparsity in $\mathbf{w}$.
When $q=2$, the term is called _ridge_ in statistics literature or _weight decay_ in machine learning. It penalizes large weights to have smaller values.
---