Bayesian Model Selection with Model Evidence

# Bayesian Model Selection with Model Evidence The bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model, along with consistent application of the sum and product rules of probability. Suppose we wish to compare $L$ models $\left\{\mathcal{M}_{i}\right\}_{i=1}^{L}$. Here a model refers to the probability distribution over the observed data $D$. We suppose that the data is generated from one of these models but we are uncertain which one. Our uncertainty is expressed through a prior probability distribution $p\left(\mathcal{M}_{i}\right)$. The prior allows us to express a preference for different models. We wish to evaluate the posterior distribution $ p\left(\mathcal{M}_{i} \mid \mathcal{D}\right) \propto p\left(\mathcal{M}_{i}\right) p\left(\mathcal{D} \mid \mathcal{M}_{i}\right) $ The term $p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)$ is known as *model evidence*, and expresses the preference shown by the data for different models. Once we know the posterior distribution over the models, the predictive distribution is given as $p(t' \mid \mathbf{x'}, \mathcal{D})=\sum_{i=1}^{L} p\left(t' \mid \mathbf{x'}, \mathcal{M}_{i}, \mathcal{D}\right) p\left(\mathcal{M}_{i} \mid \mathcal{D}\right)$ This predictive distribution is the *mixture distribution* which is obtained by averaging predictive distributions of individual models weighted by the posterior of those models. The predictive distribution becomes intractable quickly and computationally intense. Thus we use approximation by selecting the most probable model. $ \mathcal{M}^{*}=\underset{\mathcal{M}_{i}}{\arg \max } \ p\left(\mathcal{M}_{i} \mid D\right) = \underset{\mathcal{M}_{i}}{\arg \max } p\left(D \mid \mathcal{M}_{i}\right) p\left(\mathcal{M}_{i}\right) $ So the most probable model is selected as the model with the maximum model evidence. (Note that we use $p\left(\mathcal{M}_{i}\right)$ as a flat prior ). Two models can be compared as the ratio of model evidence, which is called *bayes factor*. For a model governed by a set of parameters $\mathbf{w}$, the model evidence is given, from the sum and product rules of probability, as $ p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)=\int p\left(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w} $ The model evidence is sometimes also called the *marginal likelihood* because it can be viewed as a likelihood function over the space of models, in which the parameters are marginalized out. It's interesting to note that the evidence is precisely normalizing term from Bayes's theorem $ p\left(\mathbf{w} \mid \mathcal{D}, \mathcal{M}_{i}\right)=\frac{p\left(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathcal{M}_{i}\right)}{p\left(\mathcal{D} \mid \mathcal{M}_{i}\right)} $ ## Comparing models with Bayes Factor Two models can be compared by dividing their posteriors: $\frac{p\left(\mathcal{M}_1 \mid \mathcal{D}\right)}{p\left(\mathcal{M}_2 \mid \mathcal{D}\right)}=\frac{p\left(\mathcal{M}_1\right) p\left(\mathcal{D} \mid \mathcal{M}_1\right)}{p\left(\mathcal{M}_2\right) p\left(\mathcal{D} \mid \mathcal{M}_2\right)} \quad$ where $\quad \frac{p\left(\mathcal{D} \mid \mathcal{M}_1\right)}{p\left(\mathcal{D} \mid \mathcal{M}_2\right)}$ is called Bayes factor (K). A value of _K_ > 1 means that M_1 is more strongly supported by the data under consideration than M_2. Interpretation can be done using following table Kass and Raftery (1995): | $\boldsymbol{K}$ | Strength of evidence | | :--- | :--- | | 1 to 3.2 | Not worth more than a bare mention | | 3.2 to 10 | Substantial | | 10 to 100 | Strong | | gt;100$ | Decisive | ## Approximated Model Evidence Model evidence can be approximated as $ \ln p\left(\mathcal{D} \mid \mathcal{M}_i\right) \simeq \ln p\left(\mathcal{D} \mid \mathrm{w}_{\mathrm{MAP}}, \mathcal{M}_i\right)+M \ln \left(\frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}}\right) $ The first term represents data fit using the MAP (maximum a posteriori) parameters. The second term penalizes model complexity: M is the number of parameters, and the ratio compares how much the data narrows the parameter distributions from prior to posterior. More complex models face larger penalties. The delta terms come from the Laplace approximation to the model evidence integral: **Δw_prior:** The "width" or range of the prior distribution p(w|M). For a uniform prior over [-a, a], this would be 2a. For a Gaussian prior with standard deviation σ, this is proportional to σ. **Δw_posterior:** The "width" of the posterior distribution p(w|D,M) around the MAP estimate. This comes from the curvature of the log-posterior at w_MAP (the Hessian). Higher curvature = narrower posterior = smaller Δw_posterior. ## Model Evidence and Medium Complexity Let's consider three models $\mathcal{M}_{1}, \mathcal{M}_{2}$ and $\mathcal{M}_{3}$ of successively increasing complexity. Let's imagine running these models in a generative setting to produce example data sets. To generate a dataset for a model, we first choose the values of the parameters from their prior distribution i.e. $\mathbf{w} \sim p\left(\mathbf{w} \mid M_{i}\right)$. Then with these parameters we sample the data from $D \sim p\left(D \mid \mathbf{w}, M_{i}\right)$. A simple model has little variability and so will generate data sets that are fairly similar to each other. Complex model can generate a great variety of data sets, so its distribution $p(\mathcal{D})$ is spread over a large region of the space of datasets and so assigns relatively small probability to any one of them. Because the distributions are normalized i.e. $\int p\left(D \mid M_{i}\right) \mathrm{d} D=1$, we can see that the particular dataset $\mathcal{D}_{0}$ has the highest value of the evidence for the model of intermediate complexity. ![[model evidence.jpg]] ## The evidence approximation In this bayseian setting, all data is used to find the hyperparameters unlike in the frequentist approach of cross validation. We would introduce prior distributions over the hyperparamtets and make predictions by marginalizing with respect to these hypeprparameters as well as with respect to the parameters $\mathbf{w}$. However, these large integrals of complete marginalizations over all the variables become intractable quickly. We can introduce an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters $\mathbf{w}$. This framework is known as *empirical Bayes*, *generalized maximum likelihood* or in the ML literature as *evidence approximation*. If we introduce hyperpriors over hyperparamters $\alpha$ and $\beta$, then the predictive distribution is obtained by marginalizing over$\mathbf{w}$, $\alpha$ and $\beta$ so that $ p\left(\mathbf{t} \mid \mathbf{X}, \mathcal{M}_{i}\right)=\iiint p\left(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathbf{t}, \mathbf{X}, \alpha, \mathcal{M}_{i}\right) p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w} \mathrm{d} \alpha \mathrm{d} \beta $ Can be approximated as $ p\left(t \mid \mathbf{X}, \mathcal{M}_{i}\right) \approx p\left(t \mid \mathbf{X}, \alpha^*, \beta^{*}, \mathcal{M}_{i}\right) $ where $ \alpha^{*}, \beta^{*}=\operatorname{argmax} p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \approx \text { argmax p( } \left.\mathbf{t} \mid \mathbf{X}, \alpha, \beta, \mathcal{M}_{i}\right) $ Thus we can see that the hyperparameters in evidence approximation are the hyperparameters of the best model obtained from model selection. ![[hyperparams from model selection.jpg]] --- ## References 1. Bishop PRML