Bayesian Estimation - Notes on AI

# Bayesian Estimation [[Maximum Likelihood Estimation]] and [[Maximum A Posteriori (MAP)]] are frequentist approach as they search for one optimal estimate of $\mathbf{w}$. A modeling approach is considered full Bayesian if it relies on a consistent applications of sum and product rules of probabilities, integrating over all the values of $\mathbf{w}$. It allows us to consider uncertainties across all levels of modeling task. Given a prior belief over $\mathbf{w}$, $p(\mathbf{w})$ and our data $D$, we are interested in posterior distribution $p(\mathbf{w}|D)$ because it reflects plausibility of $\mathbf{w}$. $ p(\mathbf{w}|D) = \frac{p(D|\mathbf{w})p(\mathbf{w})}{p(D)} $ In full Bayesian approach, we want to come up with predictive distribution that does not depend on $\mathbf{w}$. We dare not make a choice for $\mathbf{w}$, but incorporate all possible choices. We can obtain this via a marginalization process. Recall via the sum rule of probabilities that such distribution which depends on only one random variable can be achieved by integrating out the other one. The predictive distribution is then formulated as: $ \begin{align} p\left(x^{\prime} \mid D\right)&=\int p\left(x^{\prime}, \mathbf{w} \mid D\right) \mathrm{d} \mathbf{w}\\ &= \int p(x^{\prime} | D, \mathbf{w} )\ p(\mathbf{w}|D) \mathrm{d} \mathbf{w}\\ &= \int p(x^{\prime}, \mathbf{w} )\ p(\mathbf{w}|D) \mathrm{d} \mathbf{w} \end{align} $ ## Bayesian Estimation of Gaussian Distributions Dataset: $D = \{x,t\}$ Posterior distribution after observing the data: $ \begin{align} p(\mathbf{w}|x,t) &= \frac{p(t|x,\mathbf{w})p(\mathbf{w})}{p(t|x)} \\ &= \frac{p(t|x,\mathbf{w})p(\mathbf{w})}{\int p(t|x,\mathbf{w})p(\mathbf{w})\mathrm{d}\mathbf{w}} \end{align} $ The predictive distribution is then given by, $ \begin{align} p\left(t^{\prime} \mid x^{\prime}, \mathbf{x}, \mathbf{t}\right)&=\int p\left(t^{\prime}, \mathbf{w} \mid x^{\prime}, \mathbf{x}, \mathbf{t}\right) \mathrm{d} \mathbf{w}\\ &=\int p\left(t^{\prime} \mid x^{\prime},x,t, \mathbf{w}\right) \cdot p(\mathbf{w}|x,t) \mathrm{d} \mathbf{w}\\ &=\int p\left(t^{\prime} \mid x^{\prime}, \mathbf{w}\right) \cdot p(\mathbf{w}|x,t) \mathrm{d} \mathbf{w}\\ \end{align} $ The uncertainty is reflected by the posterior distribution as there are can multiple modals that are highly probable. With predictive distribution, you are Bayesian model averaging, weighting with its probability. ## Advantages and disadvantages of Bayesian approach Advantages: 1. Inclusion of prior knowledge. 2. Represents uncertainty in t both due to target noise and uncertainty over $\mathbf{w}$. Disadvantages: 1. Posterior is hard to compute analytically, so approximations are used. 2. Prior is often chosen for mathematical convenience, not as representing knowledge. The case for Bayesian Deep Learning: https://arxiv.org/pdf/2001.10995.pdf ---