# Bayesian Linear Regression
We have seen that [[Maximum Likelihood Estimation]] leads to excessively complex models and overfitting. We can use hold-out data to determine regularization coefficient to control the model complexity, but it is wasteful of valuable data.
The advantage of [[Bayesian Estimation]] of linear regression is that it avoids the overfitting problem of maximum likelihood and leads to automatic methods of determining model complexity using the training data alone.
## Parameter Distribution
Let's assume our data is drawn independantly from a Gaussian. We have the likelihood given as,
$
p(\mathbf{t}|\Phi,\mathbf{w},\beta) = \prod_{n=1}^{N} \mathcal{N}(t_n|\mathbf{w^T\Phi_n, \beta^{-1}})
$
And the prior is given by the conjugate Gaussian prior distribution,.
$
p(\boldsymbol{w})=\mathcal{N}\left(\mathbf{w} \mid \mu_{N}, \Sigma^{-1}_N\right)
$
The gaussian prior ensures the posterior will also be a Gaussian, which is obtained by multiplying the likelihood with the prior, normalized by a constant. We focus on the exponential term of the product obtained:
$
\begin{align}
&= -\frac{\beta}{2} \sum_{n=1}^{N}\left(t_{n}-\mathbf{w}^{T} \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right)\right)^{2}-\frac{1}{2}\left(\mathbf{w}-\mu_N\right)^{T} \boldsymbol{\Sigma_N}^{-1}\left(\mathbf{w}-\mu_N\right) \\
&= -\frac{\beta}{2} \sum_{n=1}^{N}\left\{t_{n}^{2}-2 t_{n} \mathbf{w}^{T} \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right)+\mathbf{w}^{T} \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right) \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right)^{T} \mathbf{w}\right\}-\frac{1}{2}\left(\mathbf{w}-\mu_N\right)^{T} \boldsymbol{\Sigma_N}^{-1}\left(\mathbf{w}-\mu_N\right) \\
&= -\frac{1}{2} \mathbf{w}^{T}\left[\sum_{n=1}^{N} \beta \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right) \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right)^{T}+\boldsymbol{\Sigma_N}^{-1}\right] \mathbf{w}
-\frac{1}{2}\left[-2 \mu_N^{T} \boldsymbol{\Sigma_N}^{-1}-\sum_{n=1}^{N} 2 \beta t_{n} \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right)^{T}\right] \mathbf{w}
+\text { const }
\end{align}
$
By comparing the above quadratic term to standard gaussian exponential term, we get
$
\Sigma_{N+1}^{-1} = \Sigma_N^{-1} + \beta \Phi_{N+1}\Phi_{N+1}^T
$
And by comparing the linear term, we get,
$
\begin{align}
-2\mu_{N+1}^T\Sigma_{N+1}^{-1}&= -2\mu_N\Sigma_{N}^{-1} - \sum_{n=1}^{N+1} 2 \beta t_{n} \boldsymbol{\phi}\left(\boldsymbol{x}_{n}\right)^{T} \\
\end{align}
$
Taking transpose and rearranging,
$
\mu_{N+1}=\Sigma_{N+1}\left({\Sigma}_{N}^{-1} {\mu}_{N}+\beta t_{N+1} \phi_{N+1}\right)
$
Thus we can write the posterior in the form
$
p(\mathbf{w} \mid \mathbf{t})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{m}_{N}, \mathbf{S}_{N}\right)
$
where
$
\begin{aligned}
\mathbf{m}_{N} &=\mathbf{S}_{N}\left(\mathbf{S}_{0}^{-1} \mathbf{m}_{0}+\beta \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}\right) \\
\mathbf{S}_{N}^{-1} &=\mathbf{S}_{0}^{-1}+\beta \mathbf{\Phi}^{\mathrm{T}} \mathbf{\Phi}
\end{aligned}
$
We can also simplify the treatment if we consider a particular form of Gaussion prior i.e. zero mean isotropic Gaussian governed by a single precision parameter $\alpha$ so that the equations reduce down to
$
p(\mathbf{w} \mid \alpha)=\mathcal{N}\left(\mathbf{w} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)
$
and
$
\begin{array}{l}
\mathbf{m}_{N}=\beta \mathbf{S}_{N} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t} \\
\mathbf{S}_{N}^{-1}=\alpha \mathbf{I}+\beta \boldsymbol{\Phi}^{\mathrm{T}} \mathbf{\Phi}
\end{array}
$
### Limiting Cases
When we have infinitely broad prior i.e. $\mathbf{S}_{0}=\alpha^{-1} \mathbf{I}$ (no restriction on $\mathbf{w}$) with $\alpha \rightarrow 0$, the mean $\mathbf{m}_N$ reduces to the maximum likelihood value $\mathbf{w}_{ML}$. Similarly, for infinitely narrow prior i.e. $\alpha \rightarrow \infty$ the posterior reverts to the prior.
### Sequential Bayesian Learning
If data points arrive sequentially, the posterior distribution at any stage acts as the prior distribution for the subsequent data point, and the new posterior is given by
$
p\left(\mathbf{w} \mid \Phi_{N+1}, \mathbf{t}_{N+1}, \mathbf{S}_{0}, \mathbf{m}_{0}, \beta\right)=\frac{p\left(\mathbf{t}_{N+1} \mid \Phi_{N+1}, \mathbf{w}, \beta\right) \cdot p\left(\mathbf{w} \mid \Phi_{N}, \mathbf{t}_{N}, \mathbf{S}_{0}, \mathbf{m}_{0}, \beta\right)}{\int p\left(\mathbf{t}_{N+1} \mid \Phi_{N+1}, \mathbf{w}, \beta\right) \cdot p\left(\mathbf{w} \mid \Phi_{N}, \mathbf{t}_{N}, \mathbf{S}_{0}, \mathbf{m}_{0}, \beta\right) d \mathbf{w}}
$
![[sequential bayesian learning.jpg]]
After seeing infinite amount of data, Bayesian, MAP and ML all agree on the parameter values.
## Predictive Distribution
In practice, we are not usually interested in the value of $\mathbf{w}$ but rather in making predictions of $t$ for new values of $\mathbf{x'}$. We do this by marginalizing out the weights by integrating over the posterior wrt weights.
$
\begin{align}
p\left(t^{\prime} \mid \mathbf{x}^{\prime}, \mathbf{X}, \mathbf{t}, \alpha, \beta\right) &= \int p(t'|\phi(\mathbf{x}')^T \mathbf{w}, \beta) p(\mathbf{w| \mathbf{X},\mathbf{t},\alpha, \beta}) d \mathbf{w}
\end{align}
$
This convolution of two gaussians can be evaluated by the method of evaluating the quadratic term of the resulting Gaussian and comparing it with standard gaussian. This gives us,
$
p\left(t^{\prime} \mid x^{\prime}, \mathbf{X}, \mathbf{t}, \alpha, \beta\right)=\mathcal{N}\left(t \mid \phi\left(x^{\prime}\right)^{T} \mathbf{m}_{N}, \sigma_{N}^{2}\left(x^{\prime}\right)\right)
$
where,
$
\sigma_{N}^{2}\left(x^{\prime}\right)=\frac{1}{\beta}+\phi\left(x^{\prime}\right)^{T} \mathbf{S}_{N} \phi\left(x^{\prime}\right)
$
Note that $\mathbf{m}_{N}=\beta \mathbf{S}_{N} \boldsymbol{\Phi}^{T} \mathbf{t} \quad \mathbf{S}_{N}^{-1}=\alpha \mathbb{1}+\beta \boldsymbol{\Phi}^{T} \boldsymbol{\Phi}$.
The first term in the variance represents the noise in the data and the second term reflects the uncertainity associated with parameters.
![[predictive distribution 1.jpg]]
Samples from the posterior distribution:
![[samples of posterior.jpg]]