gt;100$ | Decisive | ## Approximated Model Evidence Model evidence can be approximated as $ \ln p\left(\mathcal{D} \mid \mathcal{M}_i\right) \simeq \ln p\left(\mathcal{D} \mid \mathrm{w}_{\mathrm{MAP}}, \mathcal{M}_i\right)+M \ln \left(\frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}}\right) $ The first term represents data fit using the MAP (maximum a posteriori) parameters. The second term penalizes model complexity: M is the number of parameters, and the ratio compares how much the data narrows the parameter distributions from prior to posterior. More complex models face larger penalties. The delta terms come from the Laplace approximation to the model evidence integral: **Δw_prior:** The "width" or range of the prior distribution p(w|M). For a uniform prior over [-a, a], this would be 2a. For a Gaussian prior with standard deviation σ, this is proportional to σ. **Δw_posterior:** The "width" of the posterior distribution p(w|D,M) around the MAP estimate. This comes from the curvature of the log-posterior at w_MAP (the Hessian). Higher curvature = narrower posterior = smaller Δw_posterior. ## Model Evidence and Medium Complexity Let's consider three models $\mathcal{M}_{1}, \mathcal{M}_{2}$ and $\mathcal{M}_{3}$ of successively increasing complexity. Let's imagine running these models in a generative setting to produce example data sets. To generate a dataset for a model, we first choose the values of the parameters from their prior distribution i.e. $\mathbf{w} \sim p\left(\mathbf{w} \mid M_{i}\right)$. Then with these parameters we sample the data from $D \sim p\left(D \mid \mathbf{w}, M_{i}\right)$. A simple model has little variability and so will generate data sets that are fairly similar to each other. Complex model can generate a great variety of data sets, so its distribution $p(\mathcal{D})$ is spread over a large region of the space of datasets and so assigns relatively small probability to any one of them. Because the distributions are normalized i.e. $\int p\left(D \mid M_{i}\right) \mathrm{d} D=1$, we can see that the particular dataset $\mathcal{D}_{0}$ has the highest value of the evidence for the model of intermediate complexity. ![[model evidence.jpg]] ## The evidence approximation In this bayseian setting, all data is used to find the hyperparameters unlike in the frequentist approach of cross validation. We would introduce prior distributions over the hyperparamtets and make predictions by marginalizing with respect to these hypeprparameters as well as with respect to the parameters $\mathbf{w}$. However, these large integrals of complete marginalizations over all the variables become intractable quickly. We can introduce an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters $\mathbf{w}$. This framework is known as *empirical Bayes*, *generalized maximum likelihood* or in the ML literature as *evidence approximation*. If we introduce hyperpriors over hyperparamters $\alpha$ and $\beta$, then the predictive distribution is obtained by marginalizing over$\mathbf{w}$, $\alpha$ and $\beta$ so that $ p\left(\mathbf{t} \mid \mathbf{X}, \mathcal{M}_{i}\right)=\iiint p\left(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathbf{t}, \mathbf{X}, \alpha, \mathcal{M}_{i}\right) p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w} \mathrm{d} \alpha \mathrm{d} \beta $ Can be approximated as $ p\left(t \mid \mathbf{X}, \mathcal{M}_{i}\right) \approx p\left(t \mid \mathbf{X}, \alpha^*, \beta^{*}, \mathcal{M}_{i}\right) $ where $ \alpha^{*}, \beta^{*}=\operatorname{argmax} p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \approx \text { argmax p( } \left.\mathbf{t} \mid \mathbf{X}, \alpha, \beta, \mathcal{M}_{i}\right) $ Thus we can see that the hyperparameters in evidence approximation are the hyperparameters of the best model obtained from model selection. ![[hyperparams from model selection.jpg]] --- ## References 1. Bishop PRML