# Gaussian Mixture Model By using a sufficient number of [[Gaussian Distribution]], and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy. The goal with GMM is to partition into K class/clusters by maximizing the likelihood of the probabilistic model $p(\mathbf{x})$. Recall the discrete latent variable model, $ p(\boldsymbol{x})=\sum p(\boldsymbol{x}, \boldsymbol{z})=\sum p(\boldsymbol{x} \mid \boldsymbol{z}) p(\boldsymbol{z}) $ For discrete latent variables, the prior distribution are modeled as Generalized Bernoulli $z_{k} \in\{0,1\}$ for K clusters, with prior $p\left(z_{k}=1\right)=\pi_{k}, \pi_{k} \in[0,1], \sum_{k=1}^{K} \pi_{k}=1$. Here, the cluster conditionals are modeled as Gaussians $p\left(\boldsymbol{x} \mid z_{k}=1\right)=\mathcal{N}\left(\boldsymbol{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right)$ Then the joint follows as $p\left(\boldsymbol{x}, z_{k}=1\right)=p\left(\boldsymbol{x} \mid z_{k}=1\right) p\left(z_{k}=1\right)=\pi_{k} \mathcal{N}\left(\boldsymbol{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right)$. And so the marginal describes the full generative model called *Mixture of Gaussians* defined as: $ p(\mathbf{x})=\sum_{k=1}^{K} \pi_{k} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right) $ The posterior is then given by [[Bayes Theorem]] as $ \begin{aligned} p\left(z_{k}=1 \mid \boldsymbol{x}\right) &=\frac{p\left(z_{k}=1\right) p\left(\boldsymbol{x} \mid z_{k}=1\right)}{p(\boldsymbol{x})} \\ &=\frac{p\left(z_{k}=1\right) p\left(\boldsymbol{x} \mid z_{k}=1\right)}{\sum_{j} p\left(z_{j}=1\right) p\left(\boldsymbol{x} \mid z_{j}=1\right)} \\ &=\frac{\pi_{k} \mathcal{N}\left(\boldsymbol{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right)}{\sum_{j} \pi_{j} \mathcal{N}\left(\boldsymbol{x} \mid \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}\right)}= \gamma\left(z_{k}\right) \end{aligned} $ where $\gamma\left(z_{k}\right)$ is the responsibility that class $K$ takes for explaining the data point, or shortly *responsibility*. The form of the Gaussian mixture distribution is governed by the parameters $\mathbf{\pi}$, $\mu$ and $\Sigma,$ where we have used the notation $\pi \equiv\left\{\pi_{1}, \ldots, \pi_{K}\right\}, \mu \equiv\left\{\mu_{1}, \ldots, \mu_{K}\right\}$. The log of the likelihood function is given by, $ \ln p(\mathbf{X} \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})=\sum_{n=1}^{N} \ln \left\{\sum_{k=1}^{K} \pi_{k} \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right)\right\} $ Due to the presence of the summation over k inside the logarithm, it is a much more complex situation, and so the maximum likelihood solution for the parameters has no longer a closed-form analytical solution. There are more complicated reasons as to why maximum likelihood doesn't work with mixture models, explained in 9.2 Bishop 2006. We approximate maximum likelihood using the iterative method of [[Expectation Maximization]]. ## Expectation Maximization for Gaussian mixtures We can find local minima by iterative algorithm: alternate update of (expected) posterior $\gamma\left(z_{n k}\right)$ and maximization for $\pi_{k}, \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}$. We use the useful property of [[Gaussian Distribution#Some useful facts|density derrivate with respect to mean]], $ \frac{\partial}{\partial \mu_{k}} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)={\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}^{-1} $ Maximizing with respect to $\boldsymbol{\mu}_k$ $ \begin{align} &\frac{\partial}{\partial \mu_{k}} \sum_{n=1}^{N} \log p\left(\boldsymbol{x}_{n} \mid\left\{\pi_{k}\right\},\left\{\boldsymbol{\mu}_{k}\right\},\left\{\boldsymbol{\Sigma}_{k}\right\}\right) \\ &= \sum_{n=1}^{N} \frac{1}{p\left(\boldsymbol{x}_{n} \mid\left\{\pi_{k}\right\},\left\{\boldsymbol{\mu}_{k}\right\},\left\{\mathbf{\Sigma}_{k}\right\}\right)} \frac{\partial}{\partial \boldsymbol{\mu}_{k}} p\left(\boldsymbol{x}_{n} \mid\left\{\pi_{k}\right\},\left\{\boldsymbol{\mu}_{k}\right\},\left\{\boldsymbol{\Sigma}_{k}\right\}\right) \\ &= \sum_{n=1}^{N} \frac{\pi_{k} \mathcal{N}\left(\boldsymbol{x}_{n} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \mathcal{N}\left(\boldsymbol{x}_{n} \mid \boldsymbol{\mu}_{j}, \mathbf{\Sigma}_{j}\right)}\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1} \\ &=\sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1} \end{align} $ Equating it to zero, we get, $ \boldsymbol{\mu}_{k}=\frac{\sum_{n=1}^{N} \gamma\left(z_{n k}\right) \mathbf{x}_{n}}{\sum_{n=1}^{N} \gamma\left(z_{n k}\right)} $ Thus the mean is given as the weighted average of the points $x_n$ fro which cluster $k$ takes responsibility. Now, maximizing with respect to $\boldsymbol{\pi}_k$, we make use of [[Lagrange Multipliers]] $\lambda$ as we have the constraint $\sum_k \pi_k = 1$ $ \begin{align} &\frac{\partial}{\partial \pi_{k}}\left(\sum_{n=1}^{N} \log p\left(x_{n} \mid\left\{\pi_{k}\right\},\left\{\boldsymbol{\mu}_{k}\right\},\left\{\boldsymbol{\Sigma}_{k}\right\}\right)+\lambda\left(\sum_{j=1}^{K} \pi_{j}-1\right)\right) \\ &= \sum_{n=1}^{N} \frac{\mathcal{N}\left(\boldsymbol{x}_{n} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \mathcal{N}\left(\boldsymbol{x}_{n} \mid \boldsymbol{\mu}_{j}, \mathbf{\Sigma}_{j}\right)}+\lambda \\ \end{align} $ Setting the derivative to zero, $ \pi_{k}=-\frac{1}{\lambda} \sum_{n=1}^{N} \gamma\left(z_{n k}\right) $ Similarly, $ \frac{\partial}{\partial \lambda} L\left(\left\{\pi_{k}\right\}, \lambda\right)=\sum_{j=1}^{K} \pi_{j}-1=-\frac{1}{\lambda} \sum_{j=1}^{K} \sum_{n=1}^{N} \gamma\left(z_{n j}\right)-1=0 $ So, $ \lambda=-N $ Thus we have, $ \pi_{k}=\frac{1}{N} \sum_{n=1}^{N} \gamma\left(z_{n k}\right) $ Thus it can be interpreted as the fraction of points for which cluster k takes responsibility. For the M-step, Let's define the "effective number of points in cluster k" by $ N_{k}=\sum_{n=1}^{N} \gamma\left(z_{n k}\right) $ solutions for $\pi_{k}, \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}$ (dependent of the posterior) $ \pi_{k}=\frac{N_{k}}{N} $ $ \boldsymbol{\mu}_{k}=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right) \boldsymbol{x}_{n} $ $ \boldsymbol{\Sigma}_{k}=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{k}\right)\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{k}\right)^{\top} $ Thus, to summarize, 1. Initialize with a random $\pi_{k}, \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}$ 2. Repeat until convergence 1. Update the posterior (Expectation-step) $\gamma\left(z_{n k}\right)=\frac{\pi_{k} \mathscr{N}\left(\boldsymbol{x}_{n} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \mathcal{N}\left(\boldsymbol{x}_{n} \mid \boldsymbol{\mu}_{j}, \mathbf{\Sigma}_{j}\right)}$ 2. Update the parameters (Maximization step) with equations above. 3. To get soft clusters, compute the posterior $\gamma\left(z_{k}\right)$ and hard clusters are given, $k=\underset{j=1, \ldots, K}{\operatorname{argmax}} \gamma\left(z_{j}\right)$ ## Notes 1. GMM gives soft assignments in contrast to [[K-Means]] 2. GMM is more flexible because we can model a different covariance per cluster 3. GMM is slower than K-means. We can use K-means to initialize the cluster means. 4. Same local convergence issues as for K-means. 5. GMM is the similar Quadratic Discriminant Analysis, but the target is unknown and we use the EM algorithm for learning --- ## References 1. 2.3.9, 9.2 Bishop 2006