# Linear Regression via Maximum Likelihood
Assume target variable $t$ is given by a deterministic function $y$ with additive Gaussian noise so that
$
t=y(\mathbf{x}, \mathbf{w})+\epsilon
$
So we can write,
$
p(t \mid \mathbf{x}, \mathbf{w}, \beta)=\mathcal{N}\left(t \mid y(\mathbf{x}, \mathbf{w}), \beta^{-1}\right)
$
where $\beta$ is percision parameter.
Data matrix is given by $\mathbf{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\}$ and target vector is given by $\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{T}$.
Assuming the data samples are drawn independantly, the likelihood is given by:
$
p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right), \beta^{-1}\right)
$
Log likelihood is:
$
\begin{aligned}
\ln p(\mathbf{t} \mid \mathbf{w}, \beta) &=\sum_{n=1}^{N} \ln \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right), \beta^{-1}\right) \\
&=\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)-\beta E_{D}(\mathbf{w})
\end{aligned}
$
where the sum of squares error is given as:
$
E_{D}(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2}
$
To prevent error to grow with dataset size and to keep the same unit as we are predicting:
$
E_{D}^{\mathrm{RMSE}}(\mathbf{w})=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(t_{i}-w^{T} \phi\left(\underline{x}_{i}\right)\right)^{2}}
$
We can now use maximum likelihood to determine $\mathbf{w}$ and $\beta$.
![[Pasted image 1.png]]
![[Pasted image 2.png]]
It can also be derrived straight from matrix form like this:
![[Pasted image 3.png]]
Thus we find that the maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum of squares error funcion.
Note that the quantity $\mathbf{\Phi}^{\dagger} \equiv\left(\mathbf{\Phi}^{\mathrm{T}} \mathbf{\Phi}\right)^{-1} \mathbf{\Phi}^{\mathrm{T}}$ is known as the Moore Penrose pseudo-inverse of the matrix $\Phi$. It can be regarded as the generalization of matix inverse to nonsquare matrices.
The drawbacks of such direct solutions are:
1. In practice, direct solution of the equation above can lead to numerical difficulties when $\Phi^{\mathrm{T}} \Phi$ is close to singular (zero determinant). These are addressed by SVD.
2. The matrix inversion operation scales in a cubical fashion, so it is computationally intensive.
---
## References
1. Bishop 3.1.1