gt;100$ | Decisive | ## Approximated Model Evidence To investigate the behaviour of this model evidence, let's consider the case of a model having a single parameter $w$. $ p\left(D \mid \mathcal{M}_{i}\right)=\int p\left(D \mid w, \mathcal{M}_{i}\right) p\left(w \mid \mathcal{M}_{i}\right) \mathrm{d} w $ If posterior $p\left(w \mid D, \mathcal{M}_{i}\right)$ is sharply peaked at $w_{MAP}$ with width $\Delta w_{\text {posterior }}$, then we can approximate the integral by the value of the integrand at its maximum times the width of the peak. If we further assume that the prior is flat with width $\Delta w_{\text {prior }}$ so that $p\left(w \mid \mathcal{M}_{i}\right)=1 / \Delta w_{\text {prior }}$, then we have $ p\left(D \mid \mathcal{M}_{i}\right)=\int p\left(D \mid w, \mathcal{M}_{i}\right) p\left(w \mid \mathcal{M}_{i}\right) \mathrm{d} w \approx p\left(\mathcal{D} \mid w_{\mathrm{MAP}},\mathcal{M}_{i}\right) \frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}} $ ![[Screenshot 2020-09-25 at 11.18.47 AM.jpg]] Taking the log, $ \ln p\left(D \mid \mathcal{M}_{i}\right) \approx \ln p\left(D \mid w_{\mathrm{MAP}}, \mathcal{M}_{i}\right)+\ln \frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}} $ The first term represents th fit to the data with most probable model parameters. The second term penalizes the model accoriding to its complexity. For a model having M parameters that represents model complexity i.e. $\mathbf{w} \in \mathbb{R}^{M}$, we obtain, $ \ln p(\mathcal{D}|\mathcal{M}_{i}) \simeq \ln p\left(\mathcal{D} \mid \mathbf{w}_{\mathrm{MAP}},\mathcal{M}_{i}\right)+M \ln \left(\frac{\Delta w_{\text {posterior }}}{\Delta w_{\text {prior }}}\right) $ ## Model Evidence and Medium Complexity Let's consider three models $\mathcal{M}_{1}, \mathcal{M}_{2}$ and $\mathcal{M}_{3}$ of successively increasing complexity. Let's imagine running these models in a generative setting to produce example data sets. To generate a dataset for a model, we first choose the values of the parameters from their prior distribution i.e. $\mathbf{w} \sim p\left(\mathbf{w} \mid M_{i}\right)$. Then with these parameters we sample the data from $D \sim p\left(D \mid \mathbf{w}, M_{i}\right)$. A simple model has little variability and so will generate data sets that are fairly similar to each other. Complex model can generate a great variety of data sets, so its distribution $p(\mathcal{D})$ is spread over a large region of the space of datasets and so assigns relatively small probability to any one of them. Because the distributions are normalized i.e. $\int p\left(D \mid M_{i}\right) \mathrm{d} D=1$, we can see that the particular dataset $\mathcal{D}_{0}$ has the highest value of the evidence for the model of intermediate complexity. ![[model evidence.jpg]] ## The evidence approximation In this bayseian setting, all data is used to find the hyperparameters unlike in the frequentist approach of cross validation. We would introduce prior distributions over the hyperparamtets and make predictions by marginalizing with respect to these hypeprparameters as well as with respect to the parameters $\mathbf{w}$. However, these large integrals of complete marginalizations over all the variables become intractable quickly. We can introduce an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters $\mathbf{w}$. This framework is known as *empirical Bayes*, *generalized maximum likelihood* or in the ML literature as *evidence approximation*. If we introduce hyperpriors over hyperparamters $\alpha$ and $\beta$, then the predictive distribution is obtained by marginalizing over$\mathbf{w}$, $\alpha$ and $\beta$ so that $ p\left(\mathbf{t} \mid \mathbf{X}, \mathcal{M}_{i}\right)=\iiint p\left(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta, \mathcal{M}_{i}\right) p\left(\mathbf{w} \mid \mathbf{t}, \mathbf{X}, \alpha, \mathcal{M}_{i}\right) p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \mathrm{d} \mathbf{w} \mathrm{d} \alpha \mathrm{d} \beta $ Can be approximated as $ p\left(t \mid \mathbf{X}, \mathcal{M}_{i}\right) \approx p\left(t \mid \mathbf{X}, \alpha^*, \beta^{*}, \mathcal{M}_{i}\right) $ where $ \alpha^{*}, \beta^{*}=\operatorname{argmax} p\left(\alpha, \beta \mid \mathbf{t}, \mathbf{X}, \mathcal{M}_{i}\right) \approx \text { argmax p( } \left.\mathbf{t} \mid \mathbf{X}, \alpha, \beta, \mathcal{M}_{i}\right) $ Thus we can see that the hyperparameters in evidence approximation are the hyperparameters of the best model obtained from model selection. ![[hyperparams from model selection.jpg]] --- ## References 1. Bishop PRML