Tags: #notesonai #mnemonic
Topics: [[Machine Learning]]
ID: 20230107191638
---
# Bayesian Model Selection with Model Evidence
The bayesian view of model comparision simply involves the use of probabilities to represent uncertaitiy in the choice of model, along with consistent application of the sum and product rules of probability.
Suppose we wish to compare $L$ models $\left\{\mathcal{M}_{i}\right\}_{i=1}^{L}$. Here a model refers to the probability distribution over the observed data $D$.
We suppose that the data is generated from one of these models but we are uncertain which one. Our uncertainity is expressed through a prior probability distribution $p\left(\mathcal{M}_{i}\right)$. The prior allows us to express a preference for different models.
We wish to evaluate the posterior distribution
$
p\left(\mathcal{M}_{i} \mid \mathcal{D}\right) \propto p\left(\mathcal{M}_{i}\right) p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)
$
The term $p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)$ is known as *model evidence*, and expresses the preference shown by the data for different models.
Once we know the posterior distribution over the models, the predictive distribution is given as
$p(t' \mid \mathbf{x'}, \mathcal{D})=\sum_{i=1}^{L} p\left(t' \mid \mathbf{x'}, \mathcal{M}_{i}, \mathcal{D}\right) p\left(\mathcal{M}_{i} \mid \mathcal{D}\right)$
This predictive distribution is the *mixture distribution* which is obtained by averating predictive distributions of individual models weighted by the posterior of those models. The predictive distribution becomes intractable quickly and comupationally intense. Thus we use apprimation by selecting the most probable model.
$
\mathcal{M}^{*}=\underset{\mathcal{M}_{i}}{\arg \max } \ p\left(\mathcal{M}_{i} \mid D\right) = \underset{\mathcal{M}_{i}}{\arg \max } p\left(D \mid \mathcal{M}_{i}\right) p\left(\mathcal{M}_{i}\right)
$
So the most probable model is selected as the model with the maximum model evidence. (Note that we use $p\left(\mathcal{M}_{i}\right)$ as a flat prior ). Two models can be compared as the ratio of model evidence, which is called *bayes factor*.
For a model governed by a set of parameters $\mathbf{w}$, the model evidence is given, from the sum and product rules of probability, as
$
p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)=\int p\left(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w}
$
The model evidence is sometimes also called the *marginal likelihood* because it can be viewed as a likelihood function over the space of models, in which the parameters are marginalized out.
It's interesting to note that the evidence is precisely normalizing term from Bayes's theorem
$
p\left(\mathbf{w} \mid \mathcal{D}, \mathcal{M}_{i}\right)=\frac{p\left(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathcal{M}_{i}\right)}{p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)}
$
## Approximated Model Evidence
To investigate the behaviour of this model evidence, let's consider the case of a model having a single parameter $w$.
$
p\left(D \mid \mathcal{M}_{i}\right)=\int p\left(D \mid w, \mathcal{M}_{i}\right) p\left(w \mid \mathcal{M}_{i}\right) \mathrm{d} w
$
If posterior $p\left(w \mid D, \mathcal{M}_{i}\right)$ is sharply peaked at $w_{MAP}$ with width $\Delta w_{\text {posterior }}$, then we can approximate the integral by the value of the integrand at its maximum times the width of the peak. If we further assume that the prior is flat with width $\Delta w_{\text {prior }}$ so that $p\left(w \mid \mathcal{M}_{i}\right)=1 / \Delta w_{\text {prior }}$, then we have
$
p\left(D \mid \mathcal{M}_{i}\right)=\int p\left(D \mid w, \mathcal{M}_{i}\right) p\left(w \mid \mathcal{M}_{i}\right) \mathrm{d} w \approx p\left(\mathcal{D} \mid w_{\mathrm{MAP}},\mathcal{M}_{i}\right) \frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}}
$
![[Screenshot 2020-09-25 at 11.18.47 AM.jpg]]
Taking the log,
$
\ln p\left(D \mid \mathcal{M}_{i}\right) \approx \ln p\left(D \mid w_{\mathrm{MAP}}, \mathcal{M}_{i}\right)+\ln \frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}}
$
The first term represents th fit to the data with most probable model parameters. The second term penalizes the model accoriding to its complexity. For a model having M parameters that represents model complexity i.e. $\mathbf{w} \in \mathbb{R}^{M}$, we obtain,
$
\ln p(\mathcal{D}|\mathcal{M}_{i}) \simeq \ln p\left(\mathcal{D} \mid \mathbf{w}_{\mathrm{MAP}},\mathcal{M}_{i}\right)+M \ln \left(\frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}}\right)
$
## Model Evidence and Medium Complexity
Let's consider three models $\mathcal{M}_{1}, \mathcal{M}_{2}$ and $\mathcal{M}_{3}$ of succesively increasing complexity. Let's imagine running these models in a generative setting to produce example data sets.
To generate a dataset for a model, we first choose the values of the parameters from their prior distribution i.e. $\mathbf{w} \sim p\left(\mathbf{w} \mid M_{i}\right)$. Then with these parameters we sample the data from $D \sim p\left(D \mid \mathbf{w}, M_{i}\right)$.
A simple model has little variability and so will generate data sets that are fairly similar to each other. Complex model can generate a great variety of data sets, so its distribution $p(\mathcal{D})$ is spread over a large region of the space of datasets and so assigns relatively small probability to any one of them.
Because the distributions are normalized i.e. $\int p\left(D \mid M_{i}\right) \mathrm{d} D=1$, we can see that the particular dataset $\mathcal{D}_{0}$ has the highest value of the evidence for the model of intermediate complexity.
![[model evidence.jpg]]
## The evidence approximation
In this bayseian setting, all data is used to find the hyperparameters unlike in the frequentist approach of cross validation. We would introduce prior distributions over the hyperparamtets and make predictions by marginalizing with respect to these hypeprparameters as well as with respect to the parameters $\mathbf{w}$. However, these large integrals of complete marginalizations over all the variables become intractable quickly.
We can introduce an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters $\mathbf{w}$. This framework is known as *empirical Bayes*, *generalized maximum likelihood* or in the ML literature as *evidence approximation*.
If we introduce hyperpriors over hyperparamters $\alpha$ and $\beta$, then the predictive distribution is obtained by marginalizing over$\mathbf{w}$, $\alpha$ and $\beta$ so that
$
p\left(\mathbf{t} \mid \mathbf{X}, \mathcal{M}_{i}\right)=\iiint p\left(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathbf{t}, \mathbf{X}, \alpha, \mathcal{M}_{i}\right) p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w} \mathrm{d} \alpha \mathrm{d} \beta
$
Can be approximated as
$
p\left(t \mid \mathbf{X}, \mathcal{M}_{i}\right) \approx p\left(t \mid \mathbf{X}, \alpha^*, \beta^{*}, \mathcal{M}_{i}\right)
$
where
$
\alpha^{*}, \beta^{*}=\operatorname{argmax} p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \approx \text { argmax p( } \left.\mathbf{t} \mid \mathbf{X}, \alpha, \beta, \mathcal{M}_{i}\right)
$
Thus we can see that the hyperparameters in evidence approximation are the hyperparameters of the best model obtained from model selection.
![[hyperparams from model selection.jpg]]
---