Tags: #notesonai #mnemonic Topics: [[Machine Learning]] ID: 20230107191638 --- # Bayesian Model Selection with Model Evidence The bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model, along with consistent application of the sum and product rules of probability. Suppose we wish to compare $L$ models $\left\{\mathcal{M}_{i}\right\}_{i=1}^{L}$. Here a model refers to the probability distribution over the observed data $D$. We suppose that the data is generated from one of these models but we are uncertain which one. Our uncertainty is expressed through a prior probability distribution $p\left(\mathcal{M}_{i}\right)$. The prior allows us to express a preference for different models. We wish to evaluate the posterior distribution $ p\left(\mathcal{M}_{i} \mid \mathcal{D}\right) \propto p\left(\mathcal{M}_{i}\right) p\left(\mathcal{D} \mid \mathcal{M}_{i}\right) $ The term $p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)$ is known as *model evidence*, and expresses the preference shown by the data for different models. Once we know the posterior distribution over the models, the predictive distribution is given as $p(t' \mid \mathbf{x'}, \mathcal{D})=\sum_{i=1}^{L} p\left(t' \mid \mathbf{x'}, \mathcal{M}_{i}, \mathcal{D}\right) p\left(\mathcal{M}_{i} \mid \mathcal{D}\right)$ This predictive distribution is the *mixture distribution* which is obtained by averaging predictive distributions of individual models weighted by the posterior of those models. The predictive distribution becomes intractable quickly and computationally intense. Thus we use approximation by selecting the most probable model. $ \mathcal{M}^{*}=\underset{\mathcal{M}_{i}}{\arg \max } \ p\left(\mathcal{M}_{i} \mid D\right) = \underset{\mathcal{M}_{i}}{\arg \max } p\left(D \mid \mathcal{M}_{i}\right) p\left(\mathcal{M}_{i}\right) $ So the most probable model is selected as the model with the maximum model evidence. (Note that we use $p\left(\mathcal{M}_{i}\right)$ as a flat prior ). Two models can be compared as the ratio of model evidence, which is called *bayes factor*. For a model governed by a set of parameters $\mathbf{w}$, the model evidence is given, from the sum and product rules of probability, as $ p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)=\int p\left(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w} $ The model evidence is sometimes also called the *marginal likelihood* because it can be viewed as a likelihood function over the space of models, in which the parameters are marginalized out. It's interesting to note that the evidence is precisely normalizing term from Bayes's theorem $ p\left(\mathbf{w} \mid \mathcal{D}, \mathcal{M}_{i}\right)=\frac{p\left(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathcal{M}_{i}\right)}{p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)} $ Two models can be compared by dividing their posteriors: $\frac{p\left(\mathcal{M}_1 \mid \mathcal{D}\right)}{p\left(\mathcal{M}_2 \mid \mathcal{D}\right)}=\frac{p\left(\mathcal{M}_1\right) p\left(\mathcal{D} \mid \mathcal{M}_1\right)}{p\left(\mathcal{M}_2\right) p\left(\mathcal{D} \mid \mathcal{M}_2\right)} \quad$ where $\quad \frac{p\left(\mathcal{D} \mid \mathcal{M}_1\right)}{p\left(\mathcal{D} \mid \mathcal{M}_2\right)}$ is called Bayes factor (K). A value of _K_ > 1 means that M_1 is more strongly supported by the data under consideration than M_2. Interpretation can be done using following table Kass and Raftery (1995): | $\boldsymbol{K}$ | Strength of evidence | | :--- | :--- | | 1 to 3.2 | Not worth more than a bare mention | | 3.2 to 10 | Substantial | | 10 to 100 | Strong | | gt;100$ | Decisive | ## Approximated Model Evidence To investigate the behaviour of this model evidence, let's consider the case of a model having a single parameter $w$. $ p\left(D \mid \mathcal{M}_{i}\right)=\int p\left(D \mid w, \mathcal{M}_{i}\right) p\left(w \mid \mathcal{M}_{i}\right) \mathrm{d} w $ If posterior $p\left(w \mid D, \mathcal{M}_{i}\right)$ is sharply peaked at $w_{MAP}$ with width $\Delta w_{\text {posterior }}$, then we can approximate the integral by the value of the integrand at its maximum times the width of the peak. If we further assume that the prior is flat with width $\Delta w_{\text {prior }}$ so that $p\left(w \mid \mathcal{M}_{i}\right)=1 / \Delta w_{\text {prior }}$, then we have $ p\left(D \mid \mathcal{M}_{i}\right)=\int p\left(D \mid w, \mathcal{M}_{i}\right) p\left(w \mid \mathcal{M}_{i}\right) \mathrm{d} w \approx p\left(\mathcal{D} \mid w_{\mathrm{MAP}},\mathcal{M}_{i}\right) \frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}} $ ![[Screenshot 2020-09-25 at 11.18.47 AM.jpg]] Taking the log, $ \ln p\left(D \mid \mathcal{M}_{i}\right) \approx \ln p\left(D \mid w_{\mathrm{MAP}}, \mathcal{M}_{i}\right)+\ln \frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}} $ The first term represents th fit to the data with most probable model parameters. The second term penalizes the model accoriding to its complexity. For a model having M parameters that represents model complexity i.e. $\mathbf{w} \in \mathbb{R}^{M}$, we obtain, $ \ln p(\mathcal{D}|\mathcal{M}_{i}) \simeq \ln p\left(\mathcal{D} \mid \mathbf{w}_{\mathrm{MAP}},\mathcal{M}_{i}\right)+M \ln \left(\frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}}\right) $ ## Model Evidence and Medium Complexity Let's consider three models $\mathcal{M}_{1}, \mathcal{M}_{2}$ and $\mathcal{M}_{3}$ of successively increasing complexity. Let's imagine running these models in a generative setting to produce example data sets. To generate a dataset for a model, we first choose the values of the parameters from their prior distribution i.e. $\mathbf{w} \sim p\left(\mathbf{w} \mid M_{i}\right)$. Then with these parameters we sample the data from $D \sim p\left(D \mid \mathbf{w}, M_{i}\right)$. A simple model has little variability and so will generate data sets that are fairly similar to each other. Complex model can generate a great variety of data sets, so its distribution $p(\mathcal{D})$ is spread over a large region of the space of datasets and so assigns relatively small probability to any one of them. Because the distributions are normalized i.e. $\int p\left(D \mid M_{i}\right) \mathrm{d} D=1$, we can see that the particular dataset $\mathcal{D}_{0}$ has the highest value of the evidence for the model of intermediate complexity. ![[model evidence.jpg]] ## The evidence approximation In this bayseian setting, all data is used to find the hyperparameters unlike in the frequentist approach of cross validation. We would introduce prior distributions over the hyperparamtets and make predictions by marginalizing with respect to these hypeprparameters as well as with respect to the parameters $\mathbf{w}$. However, these large integrals of complete marginalizations over all the variables become intractable quickly. We can introduce an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters $\mathbf{w}$. This framework is known as *empirical Bayes*, *generalized maximum likelihood* or in the ML literature as *evidence approximation*. If we introduce hyperpriors over hyperparamters $\alpha$ and $\beta$, then the predictive distribution is obtained by marginalizing over$\mathbf{w}$, $\alpha$ and $\beta$ so that $ p\left(\mathbf{t} \mid \mathbf{X}, \mathcal{M}_{i}\right)=\iiint p\left(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathbf{t}, \mathbf{X}, \alpha, \mathcal{M}_{i}\right) p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w} \mathrm{d} \alpha \mathrm{d} \beta $ Can be approximated as $ p\left(t \mid \mathbf{X}, \mathcal{M}_{i}\right) \approx p\left(t \mid \mathbf{X}, \alpha^*, \beta^{*}, \mathcal{M}_{i}\right) $ where $ \alpha^{*}, \beta^{*}=\operatorname{argmax} p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \approx \text { argmax p( } \left.\mathbf{t} \mid \mathbf{X}, \alpha, \beta, \mathcal{M}_{i}\right) $ Thus we can see that the hyperparameters in evidence approximation are the hyperparameters of the best model obtained from model selection. ![[hyperparams from model selection.jpg]] --- ## References 1. Bishop PRML