Maximum A Posteriori (MAP)

# Maximum A Posteriori With MAP, we consider to optimize parametric distribution via the principles of maximizing posterior probability for modal weights given observed data samples. This is formalized as: $ W_{MAP} = \underset{\mathbf{w}}{\arg \max }\ p(w|D) $ where $p(w|D)$ is the posterior distribution. In [[Maximum Likelihood Estimation]], we chose W such that the data likelihood $p(D|w)$ is maximized. However in MAP, we choose the most probable W given the data i.e the posterior. Dataset: $D = \{x,t\}$ Model: $p(t \mid x, \mathbf{w}, \beta)=\mathcal{N}\left(t \mid y(x, \mathbf{w}), \beta^{-1}\right)=\sqrt{\frac{\beta}{2 \pi}} \exp \left[-\frac{\beta}{2}(t-y(x, \mathbf{w}))^{2}\right]$ Given a prior $p(w|\alpha)$, the posterior distribution is given by Bayes theorem as: $ \begin{align} p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \beta, \alpha)= \frac{p(t|x,\mathbf{w}, \beta).p(w|\alpha)}{p(t|x,\beta,\alpha)} \end{align} $ Note that the denominator does not depend on $\mathbf{w}$. Now MAP is formulated as: $ \begin{align} \mathbf{w}_{M A P}=\underset{\mathbf{w}}{\operatorname{argmax}}\ p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \beta, \alpha)&=\underset{\mathbf{w}}{\operatorname{argmax} }\ \log p\left(\mathbf{w} \mid \mathbf{X}, t_{i} \beta, \alpha\right)\\ &= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha) - \log [(t|x,\beta,\alpha)] \end{align} $ The third term does not contribute to the solution. Thus, $ \mathbf{w}_{M A P}= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha) $ ## Maximum A Posteriori Estimation for Gaussian Distributions Lets model the prior distribution as a Gaussian: $ \begin{align} p(\mathbf{w} \mid \alpha)&=\prod_{i=1}^{M} \mathcal{N}\left(w_{i} \mid 0, \alpha^{-1}\right)\\ &=\left(\frac{\alpha}{2 \pi}\right)^{\frac{\mu}{2}} \prod_{i=1}^{\mu} e^{-\frac{a}{2} w_{i} w_{i}}\\ &=\left(\frac{\alpha}{2 \pi}\right)^{\frac{\mu}{2}} e^{-\frac{\alpha}{2} w^{\top} w} \end{align} $ We know from MAP, $ \begin{align} \mathbf{w}_{M A P}&= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha)\\ &= \underset{\mathbf{w}}{\operatorname{argmin} } -\log p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)-\log p(\mathbf{w} \mid \alpha)\\ &= \underset{\mathbf{w}}{\operatorname{argmin} } -\log p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta) + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w} \end{align} $ Modeling the data distribution as Gaussian, $ p(t \mid x, \mathbf{w}, \beta)=\sqrt{\frac{\beta}{2 \pi}} \exp \left[-\frac{\beta}{2}(t-y(x, \mathbf{w}))^{2}\right] $ Thus, $ \begin{align} \mathbf{w}_{M A P}&=\underset{\mathbf{w}}{\operatorname{argmin} } \frac{\beta}{2}\sum_{i=1}^N (t - y(x,w))^2 + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w} \end{align} $ Therefore, MAP reduces to the minimization of quadratic loss and quadratic penalty of weights. The predictive distribution is given by, $ p\left(t^{\prime} \mid x^{\prime}, \mathbf{w}_{\mathrm{MAP}}, \beta\right)=\mathcal{N}\left(t^{\prime} \mid y\left(x, w_{m_{\mathrm{AP}}}\right), \beta^{-1}\right) $ To get point estimate from the predictive distribution using expected value of the distribution: $ \mathbb{E}\left[t^{\prime} \mid x^{\prime}, \mathbf{w}_{\mathrm{MAP}}, \beta\right]=y\left(x, w_{M A P}\right) $