Linear Regression via Maximum Likelihood

# Linear Regression via Maximum Likelihood Assume target variable $t$ is given by a deterministic function $y$ with additive Gaussian noise so that $ t=y(\mathbf{x}, \mathbf{w})+\epsilon $ So we can write, $ p(t \mid \mathbf{x}, \mathbf{w}, \beta)=\mathcal{N}\left(t \mid y(\mathbf{x}, \mathbf{w}), \beta^{-1}\right) $ where $\beta$ is precision parameter. Data matrix is given by $\mathbf{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\}$ and target vector is given by $\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{T}$. Assuming the data samples are drawn independently, the likelihood is given by: $ p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)=\prod_{n=1}^{N} \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right), \beta^{-1}\right) $ Log likelihood is: $ \begin{aligned} \ln p(\mathbf{t} \mid \mathbf{w}, \beta) &=\sum_{n=1}^{N} \ln \mathcal{N}\left(t_{n} \mid \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right), \beta^{-1}\right) \\ &=\frac{N}{2} \ln \beta-\frac{N}{2} \ln (2 \pi)-\beta E_{D}(\mathbf{w}) \end{aligned} $ where the sum of squares error is given as: $ E_{D}(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{t_{n}-\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right\}^{2} $ To prevent error to grow with dataset size and to keep the same unit as we are predicting: $ E_{D}^{\mathrm{RMSE}}(\mathbf{w})=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(t_{i}-w^{T} \phi\left(\underline{x}_{i}\right)\right)^{2}} $ We can now use maximum likelihood to determine $\mathbf{w}$ and $\beta$. ![[Pasted image 1.png]] ![[Pasted image 2.png]] It can also be derrived straight from matrix form like this: ![[Pasted image 3.png]] Thus we find that the maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum of squares error funcion. Note that the quantity $\mathbf{\Phi}^{\dagger} \equiv\left(\mathbf{\Phi}^{\mathrm{T}} \mathbf{\Phi}\right)^{-1} \mathbf{\Phi}^{\mathrm{T}}$ is known as the Moore Penrose pseudo-inverse of the matrix $\Phi$. It can be regarded as the generalization of matix inverse to nonsquare matrices. The drawbacks of such direct solutions are: 1. In practice, direct solution of the equation above can lead to numerical difficulties when $\Phi^{\mathrm{T}} \Phi$ is close to singular (zero determinant). These are addressed by SVD. 2. The matrix inversion operation scales in a cubical fashion, so it is computationally intensive. --- ## References 1. Bishop 3.1.1