# Expectation Maximization
The EM algorithm is a two-stage iterative optimization technique for finding maximum likelihood solutions for probabilistic models having latent variables.
Consider a probabilistic model in which we collectively denote all of the observed variables by $\mathbf{X}$ and all of the hidden variables by $\mathbf{Z}$. The joint distribution $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is governed by a set of parameters denoted $\boldsymbol{\theta} .$ Our goal is to maximize the likelihood function that is given by
$
p(\mathbf{X} \mid \boldsymbol{\theta})=\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})
$
Here, let's assume that the direct optimization of $p(\mathbf{X} \mid \boldsymbol{\theta})$ is difficult, but the optimization using $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is significantly easier.
Introducing the distribution $q(\mathbf{z})$, from the product rule of probability,
$
\ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})=\ln p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})+\ln p(\mathbf{X} \mid \boldsymbol{\theta})
$
Then we have that the following decomposition holds always
$
\ln p(\mathbf{X} \mid \boldsymbol{\theta})=\mathcal{L}(q, \boldsymbol{\theta})+\mathrm{KL}(q \| p)
$
where we have defined
$
\begin{aligned}
\mathcal{L}(q, \boldsymbol{\theta}) &=\sum_{\mathbf{Z}} q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})}\right\} \\
\mathrm{KL}(q \| p) &=-\sum_{\mathbf{Z}} q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})}{q(\mathbf{Z})}\right\}
\end{aligned}
$
Notice the difference in sign and the conditional/joint. Given $\mathrm{KL}(q \| p) \geqslant 0$ $\mathcal{L}(q, \theta)$ is a lower bound of the maximum log likelihood.
![[Screenshot 2020-10-08 at 3.33.42 PM.jpg]]
In the E step, the lower bound $\mathcal{L}\left(q, \boldsymbol{\theta}^{\text {old }}\right)$ is maximized with respect to $q(\mathbf{Z})$ while holding $\boldsymbol{\theta}^{\text {old }}$ fixed. The solution to this maximization problem is easily seen by noting that the value of $\ln p\left(\mathbf{X} \mid \boldsymbol{\theta}^{\mathrm{old}}\right)$ does not depend on $q(\mathbf{Z})$ and so the largest value occurs when the KL divergence vanishes i.e. when $q(\mathbf{Z})$ is equal to the posterior $p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\mathrm{old}}\right)$.
In the M step, $q(\mathbf{Z})$ is fixed and the. lower bound $\mathcal{L}(q, \boldsymbol{\theta})$ is maximized with respect to $\boldsymbol{\theta}$ to give some new value $\boldsymbol{\theta}^{\text {new }}$. This causes the lower bound to increase, which will cause the log likelihood function to increase, and then again q will not equal the new posterior and hence there will be nonzero KL divergence. Substituting $q(\mathbf{Z})=p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right)$, we see that after each E, the lower bound takes the form,
$
\begin{aligned}
\mathcal{L}(q, \boldsymbol{\theta}) &=\sum_{\mathbf{Z}} p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})-\sum_{\mathbf{Z}} p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \\
&=\mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)+\text { const }
\end{aligned}
$
where the constant is simply the negative entropy of q and is therefore independent of $\theta$. Thus in the M step, the quantity that is being maximized is the expectation of the complete-data log likelihood.
For iid data, X will comprise of N data points while Z will comprise N latent variables $\left\{\mathbf{z}_{n}\right\},$ where $n=1, \ldots, N$. So, the posterior probability in E step takes the form
$
p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})=\frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}=\frac{\prod_{n=1}^{N} p\left(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}\right)}{\sum_{\mathbf{Z}} \prod_{n=1}^{N} p\left(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}\right)}=\prod_{n=1}^{N} p\left(\mathbf{z}_{n} \mid \mathbf{x}_{n}, \boldsymbol{\theta}\right)
$
Thus the algorithm is given as,
Initialize parameters $\boldsymbol{\theta}$
Until convergence, repeat
Set $\boldsymbol{\theta^{old}}$ = $\boldsymbol{\theta}$
1. Expectation Step: $\mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right) = \mathbb{E}_[ p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})]$ i.e. fix the parameters and calculate the posterior (also called responsibility)
2. Maximization step: $\boldsymbol{\theta} = \underset{\boldsymbol{\theta}}{\text{argmax}} \mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)$ i.e fix the posterior and update the parameters
## Properies of EM
From our above discussion, it follows that EM has the following
properties:
- The marginal likelihood increases after each EM cycle.
- since the marginal likelihood is upper-bounded by its true global maximum, and it increases at every step, EM must eventually converge.
However, since we optimizing a non-convex objective, we have no guarantee to find the global optimum. In fact, EM in practice converges almost always to a local optimum, and moreover, that optimum heavily depends on the choice of initialization, which is its main downside.
---
## References
1. 9.4, Bishop 2006
2. https://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class22.pdf
3. 1. Stanford CS228 notes on latent variable models https://ermongroup.github.io/cs228-notes/learning/latent/