Expectation Maximization

# Expectation Maximization The EM algorithm is a two-stage iterative optimization technique for finding maximum likelihood solutions for probabilistic models having latent variables. Consider a probabilistic model in which we collectively denote all of the observed variables by $\mathbf{X}$ and all of the hidden variables by $\mathbf{Z}$. The joint distribution $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is governed by a set of parameters denoted $\boldsymbol{\theta} .$ Our goal is to maximize the likelihood function that is given by $ p(\mathbf{X} \mid \boldsymbol{\theta})=\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) $ Here, let's assume that the direct optimization of $p(\mathbf{X} \mid \boldsymbol{\theta})$ is difficult, but the optimization using $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is significantly easier. Introducing the distribution $q(\mathbf{z})$, from the product rule of probability, $ \ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})=\ln p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})+\ln p(\mathbf{X} \mid \boldsymbol{\theta}) $ Then we have that the following decomposition holds always $ \ln p(\mathbf{X} \mid \boldsymbol{\theta})=\mathcal{L}(q, \boldsymbol{\theta})+\mathrm{KL}(q \| p) $ where we have defined $ \begin{aligned} \mathcal{L}(q, \boldsymbol{\theta}) &=\sum_{\mathbf{Z}} q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})}\right\} \\ \mathrm{KL}(q \| p) &=-\sum_{\mathbf{Z}} q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})}{q(\mathbf{Z})}\right\} \end{aligned} $ Notice the difference in sign and the conditional/joint. Given $\mathrm{KL}(q \| p) \geqslant 0$ $\mathcal{L}(q, \theta)$ is a lower bound of the maximum log likelihood. ![[Screenshot 2020-10-08 at 3.33.42 PM.jpg]] In the E step, the lower bound $\mathcal{L}\left(q, \boldsymbol{\theta}^{\text {old }}\right)$ is maximized with respect to $q(\mathbf{Z})$ while holding $\boldsymbol{\theta}^{\text {old }}$ fixed. The solution to this maximization problem is easily seen by noting that the value of $\ln p\left(\mathbf{X} \mid \boldsymbol{\theta}^{\mathrm{old}}\right)$ does not depend on $q(\mathbf{Z})$ and so the largest value occurs when the KL divergence vanishes i.e. when $q(\mathbf{Z})$ is equal to the posterior $p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\mathrm{old}}\right)$. In the M step, $q(\mathbf{Z})$ is fixed and the. lower bound $\mathcal{L}(q, \boldsymbol{\theta})$ is maximized with respect to $\boldsymbol{\theta}$ to give some new value $\boldsymbol{\theta}^{\text {new }}$. This causes the lower bound to increase, which will cause the log likelihood function to increase, and then again q will not equal the new posterior and hence there will be nonzero KL divergence. Substituting $q(\mathbf{Z})=p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right)$, we see that after each E, the lower bound takes the form, $ \begin{aligned} \mathcal{L}(q, \boldsymbol{\theta}) &=\sum_{\mathbf{Z}} p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})-\sum_{\mathbf{Z}} p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \\ &=\mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)+\text { const } \end{aligned} $ where the constant is simply the negative entropy of q and is therefore independent of $\theta$. Thus in the M step, the quantity that is being maximized is the expectation of the complete-data log likelihood. For iid data, X will comprise of N data points while Z will comprise N latent variables $\left\{\mathbf{z}_{n}\right\},$ where $n=1, \ldots, N$. So, the posterior probability in E step takes the form $ p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})=\frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}=\frac{\prod_{n=1}^{N} p\left(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}\right)}{\sum_{\mathbf{Z}} \prod_{n=1}^{N} p\left(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}\right)}=\prod_{n=1}^{N} p\left(\mathbf{z}_{n} \mid \mathbf{x}_{n}, \boldsymbol{\theta}\right) $ Thus the algorithm is given as, Initialize parameters $\boldsymbol{\theta}$ Until convergence, repeat Set $\boldsymbol{\theta^{old}}$ = $\boldsymbol{\theta}$ 1. Expectation Step: $\mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right) = \mathbb{E}_[ p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})]$ i.e. fix the parameters and calculate the posterior (also called responsibility) 2. Maximization step: $\boldsymbol{\theta} = \underset{\boldsymbol{\theta}}{\text{argmax}} \mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)$ i.e fix the posterior and update the parameters ## Properies of EM From our above discussion, it follows that EM has the following properties: - The marginal likelihood increases after each EM cycle. - since the marginal likelihood is upper-bounded by its true global maximum, and it increases at every step, EM must eventually converge. However, since we optimizing a non-convex objective, we have no guarantee to find the global optimum. In fact, EM in practice converges almost always to a local optimum, and moreover, that optimum heavily depends on the choice of initialization, which is its main downside. ## References 1. 9.4, Bishop 2006 2. https://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class22.pdf 3. 1. Stanford CS228 notes on latent variable models https://ermongroup.github.io/cs228-notes/learning/latent/