Probabilistic Generative Models

# Probabilistic Generative Models A model is generative if it places a joint distribution over all observed dimensions of the data. In generative models, we model the class-conditional densities $p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)$, as well as the class priors $p\left(\mathcal{C}_{k}\right)$, and then use these to compute posterior probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ through [[Bayes Theorem]]. Let's consider fist binary classes i.e. K=2 Class conditional density: $p\left(x \mid C_{k}\right)$ Prior class probabilities: $p\left(C_{k}\right)$ Joint distribution: $p\left(x, C_{k}\right)=p\left(x \mid C_{k}\right) p\left(C_{k}\right)$ Posterior distribution is given as $ \begin{align} p\left(C_{1} \mid \mathbf{x}\right) &=\frac{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)}{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)+p\left(\mathbf{x} \mid C_{2}\right) p\left(C_{2}\right)} \\ &= \frac{1}{1+\frac{p\left(x \mid C_{2}\right) p\left(C_{2}\right)}{p(x| C_1) p\left(C_{1}\right)}} \\ &= \frac{1}{1+e^{-a}} \\ &= \sigma(a) \\ \end{align} $ where, $\sigma(a)$ is _logistic sigmoid_ gunction and $a=\ln \frac{\sigma}{1-\sigma}=\ln \frac{p\left(x \mid C_{1}\right) p(C_1)}{p\left(x \mid C_{2}\right) p\left(C_{2}\right)}$ where the log of the ratio of the probabilities is known as _log odds_. For multiple classes i.e. general K, $ p\left(C_{k} \mid \mathbf{x}\right)=\frac{p\left(\mathbf{x} \mid C_{k}\right) p\left(C_{k}\right)}{\sum_{j=1}^{K} p\left(\mathbf{x} \mid C_{j}\right) p\left(C_{j}\right)} = \frac{\exp(a_k)}{\sum_{k}^{j=1}\exp(a_j)} $ known as the _softmax function_ and $a_{k}=\ln \left(p\left(\mathbf{x} \mid C_{k}\right) p\left(C_{k}\right)\right)$ ## Continuous inputs: Linear Discriminant Analysis Let's assume the the inputs are continuous and use [[Gaussian Distribution]] to model the class condtional densities: $ p\left(\mathbf{x} \mid C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{\left|\mathbf{\Sigma}_{k}\right|^{1 / 2}} \exp \left\{\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}_{k}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\} $ Assuming the classes share the covaraince matrix $\mathbf{\Sigma}_k = \Sigma$, we call this probabilistic generative modelling as _Linear Discriminant Analysis (LDA)_. For the case of K=2, the class posterior is given as $ p\left(C_{1} \mid \mathbf{x}\right)=\frac{1}{1+\exp (-a)}=\sigma(a) $ where $ \begin{align} a&=\ln \frac{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)}{p\left(\mathbf{x} \mid C_{2}\right) p\left(C_{2}\right)} \\ &= \ln \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)-\ln \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)} \\ &= -\frac{1}{2} \ln |\boldsymbol{\Sigma}|-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{1}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{1}\right)+\frac{1}{2} \ln \left(\mid\mathbf{\Sigma} \mid+\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{2}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{2}\right)+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)}\right. \\ &= (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T\Sigma^{-1}\mathbf{X} - \frac{1}{2}\boldsymbol{\mu}_1^T\Sigma^{-1}\boldsymbol{\mu}_1 + \frac{1}{2}\boldsymbol{\mu}_2^T\Sigma^{-1}\boldsymbol{\mu}_2 + \ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)} \end{align} $ Writing in terms of _generalized linear model_, we have $ p\left(C_{1} \mid \mathbf{x}\right)=\sigma\left(\mathbf{w}^{T} \mathbf{x}+w_{0}\right) $ where $ \begin{array}{l} \mathbf{w}=\mathbf{\Sigma}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right) \\ w_{0}=-\frac{1}{2} \boldsymbol{\mu}_{1}^{T} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{1}+\frac{1}{2} \boldsymbol{\mu}_{2}^{T} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{2}+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)} \end{array} $ We see that the quadratic terms in $\mathbf{x}$ from the exponents of the Gaussian densities have cancelled due to the assumption of common covariances matrices. This leads to a linear function of $\mathbf{x}$ in the argument of the logistic sigmoid. The resulting decision boundaries correspond to surfaces along which the posterior probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ are constant and so will be given by linear functions of $\mathbf{x}$ and therefore the decision boundaries are linear in input space. The prior probabilities enter only through the bias parameter $w_0$ so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability. ![[Probabilistic generative models.jpg]] For the general case of $K$ classes we have $ p\left(\mathbf{x} \mid C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\mathbf{\Sigma}|^{1 / 2}} \exp \left\{\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\} $ $ p\left(C_{k} \mid \mathbf{x}\right)=\frac{\exp \left(a_{k}(\mathbf{x})\right)}{\sum_{j=1}^{K} \exp \left(a_{j}(\mathbf{x})\right)} $ $ a_{k}(\mathbf{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0} $ where we have $ \begin{aligned} \mathbf{w}_{k} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k} \\ w_{k 0} &=-\frac{1}{2} \boldsymbol{\mu}_{k}^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k}+\ln p\left(\mathcal{C}_{k}\right) \end{aligned} $ The decision boundary is $p\left(C_{k} \mid \mathbf{x}\right)=p\left(C_{j} \mid \mathbf{x}\right)$ i.e. $a_{k}(x)=a_{j}(x)$ If we relax the assumption of a shared covariance matrix and allow each class conditional density $p(\mathbf{x}|\mathcal{C}_k)$ to have it's own class covariance matrix $\boldsymbol{\Sigma}_{k}$, then the cancellations do not occur and we obtain quadratic terms in $\mathbf{x}$, giving rise to a _quadratic discriminant_. ![[LDA and QDA.jpg]] ### Maximum Likelihood Solution Let's use maximum likelihood to estimate $\mathbf{\mu}_k$, $\Sigma$ and priors $p(C_k)$. Denote $p\left(C_{1}\right)=\pi$ and $p\left(C_{2}\right)=1-\pi$ For this binary classification, use $C_1$= 1, $C_2$=0. So we set $t_n$ to select for each classes. Then we have for $C_1$ $ p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n} \mid \mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right) $ And for $C_2$ $ p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} \mid \mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) $ Thus the likelihood is given by $ \begin{align} p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) &= \prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)\right]^{1-t_{n}} \end{align} $ Taking the log of the likelihood, $ \begin{align} \ln p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) &= \sum_{n=1}^{N} t_{n} \ln \pi+t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)+ \left(1-t_{n}\right) \ln (1-\pi)+\left(1-t_{n}\right) \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) \end{align} $ Taking only the terms that depend on $\pi$, we get $ \sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\} $ Taking the derrivative and rearranging, $ \pi_{ML}=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}} $ where $N_1$ is the total number of $C_1$ points and similar for $N_2$. Now to estimate $\mu_1$, let's picl out the terms dependant on $\mu_1$ form the log likelihood function and taking derrivative $ \begin{align} \frac{\partial}{\partial \boldsymbol{\mu}_{1}} \sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right) &= -\frac{1}{2} \frac{\partial}{\partial \boldsymbol{\mu}_{1}} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right) \end{align} $ Setting to zero, we have $ \mu_{1}=\frac{1}{N_{1}} \sum_{n=1}^{N} t_{n} \mathbf{x}_{n} $ Similarly for $\mu_2$ is given by $ \boldsymbol{\mu}_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N}\left(1-t_{n}\right) \mathbf{x}_{n} $ For the likelihood solution of the shared covariance matrix $\Sigma$, we have, $ \begin{align} \frac{\partial}{\partial \boldsymbol{\Sigma}} \ln p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)&=\frac{\partial}{\partial \boldsymbol{\Sigma}}\left[-\frac{N}{2} \ln |\boldsymbol{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\right.\\ & -\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right) = 0 \end{align} $ rearraning, we have $ \begin{aligned} \mathbf{S} &=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2} \\ \mathbf{S}_{1} &=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \\ \mathbf{S}_{2} &=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}} \end{aligned} $ ### Prediction For new datapoint $\mathbf{x'}$ $ \begin{align} p\left(C_{1} \mid \mathbf{x}^{\prime}\right)=\sigma\left(\mathbf{w}_{\mathrm{ML}}^{T} \mathbf{x}^{\prime}+w_{0, \mathrm{ML}}\right) \end{align} $ where, $ \begin{aligned} \mathbf{w}_{\mathrm{ML}} &=\mathbf{\Sigma}_{\mathrm{ML}}^{-1}\left(\boldsymbol{\mu}_{1, \mathrm{ML}}-\boldsymbol{\mu}_{2, \mathrm{ML}}\right) \\ w_{0, \mathrm{ML}} &=-\frac{1}{2} \boldsymbol{\mu}_{1, \mathrm{ML}}^{T} \mathbf{\Sigma}_{\mathrm{ML}}^{-1} \boldsymbol{\mu}_{1, \mathrm{ML}}+\frac{1}{2} \boldsymbol{\mu}_{2, \mathrm{ML}}^{T} \mathbf{\Sigma}_{\mathrm{ML}}^{-1} \boldsymbol{\mu}_{2, \mathrm{ML}}+\ln \frac{\pi_{\mathrm{ML}}}{1-\pi_{\mathrm{ML}}} \end{aligned} $ and assing $\mathbf{x'}$ to $C_1$ if $p\left(C_{1} \mid \mathbf{x}^{\prime}\right) \geq \frac{1}{2}$ ### Disadvantages of LDA 1. Gaussian distribution is sensitive to outliers 2. Lineariry/handcrafted features restrict application 3. Maximum likelihood is prone to overfitting ## Discrete inputs: Naive Bayes Let's consider discrete feature vectors $\mathbf{x}_{n}=\left(x_{1}, \ldots, x_{D}\right)^{T}$ with $x_{i} \in\{0,1\}$. Then for D-dimensional input, the number of parameters become $2^D-1$ as we need a parameter to model relationship between each pair of features for each class. The likelihood can be expressed as $ \begin{align} p(\mathbf{T}, \mathbf{X} \mid \lambda)&=\prod_{n=1}^{N} p\left(t_{n}, x_{n} \mid \lambda\right) \\ &=\prod_{n=1}^{N} p\left(t_{n} \mid \lambda\right) \cdot p\left(x_{n} \mid t_{n}, \lambda\right) \\ &=\prod_{n=1}^{N} \prod_{k=1}^{K}\left[p\left(\mathcal{C}_{k} \mid \lambda\right) \cdot p\left(x_{n} \mid \mathcal{C}_{k}, \lambda\right)\right]^{t_{n k}} \end{align} $ To seek a more restricted representation, we make the _naive Bayes_ assumption that the feature values are treated as independant **when conditioned on the class $C_k$**. $ \begin{align} &=\prod_{n=1}^{N} \prod_{k=1}^{K}\left[p\left(\mathcal{C}_{k}\right) \cdot \prod_{d=1}^{D} p\left(x_{n d} \mid \mathcal{C}_{k}, \lambda_{d k}\right)\right]^{t_{n k}} \\ &= \prod_{n=1}^{N} \prod_{k=1}^{K}\left[\pi_{k} \cdot \prod_{d=1}^{D} p\left(x_{n d} \mid \mathcal{C}_{k}, \lambda_{d k}\right)\right]^{t_{n k}} \end{align} $ In this case, the number of paramters per class is $D$, so total number of parameters is $K.D$. Note again that the features are not marginally independant, only when conditioned on the classes. For binary cases, we can model with bernoulli distribution: $ p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)=\prod_{i=1}^{D} \mu_{k i}^{x_{i}}\left(1-\mu_{k i}\right)^{1-x_{i}} $ Posterior class probability of the form $ p\left(C_{k} \mid \mathbf{x}\right)=\frac{\exp \left(a_{k}(\mathbf{x})\right)}{\sum_{j=1}^{K} \exp \left(a_{j}(\mathbf{x})\right)} $ where, $ \begin{align} a_{k}(\mathbf{x})&=\ln p\left(x \mid C_{k}\right) p\left(C_{k}\right)\\ &= \sum_{i=1}^{D}\left\{x_{i} \ln \mu_{k i}+\left(1-x_{i}\right) \ln \left(1-\mu_{k i}\right)\right\}+\ln p\left(\mathcal{C}_{k}\right) \\ \end{align} $ ## References 1. Bishop 4.2 2. http://mlss2018.net.ar/slides/Adams-1.pdf