# Probabilistic Generative Models
A model is generative if it places a joint distribution over all observed dimensions of the data. In generative models, we model the class-conditional densities $p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)$, as well as the class priors $p\left(\mathcal{C}_{k}\right)$, and then use these to compute posterior probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ through [[Bayes Theorem]].
Let's consider fist binary classes i.e. K=2
Class conditional density: $p\left(x \mid C_{k}\right)$
Prior class probabilities: $p\left(C_{k}\right)$
Joint distribution: $p\left(x, C_{k}\right)=p\left(x \mid C_{k}\right) p\left(C_{k}\right)$
Posterior distribution is given as
$
\begin{align}
p\left(C_{1} \mid \mathbf{x}\right) &=\frac{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)}{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)+p\left(\mathbf{x} \mid C_{2}\right) p\left(C_{2}\right)} \\
&= \frac{1}{1+\frac{p\left(x \mid C_{2}\right) p\left(C_{2}\right)}{p(x| C_1) p\left(C_{1}\right)}} \\
&= \frac{1}{1+e^{-a}} \\
&= \sigma(a) \\
\end{align}
$
where, $\sigma(a)$ is _logistic sigmoid_ gunction and $a=\ln \frac{\sigma}{1-\sigma}=\ln \frac{p\left(x \mid C_{1}\right) p(C_1)}{p\left(x \mid C_{2}\right) p\left(C_{2}\right)}$ where the log of the ratio of the probabilities is known as _log odds_.
For multiple classes i.e. general K,
$
p\left(C_{k} \mid \mathbf{x}\right)=\frac{p\left(\mathbf{x} \mid C_{k}\right) p\left(C_{k}\right)}{\sum_{j=1}^{K} p\left(\mathbf{x} \mid C_{j}\right) p\left(C_{j}\right)} = \frac{\exp(a_k)}{\sum_{k}^{j=1}\exp(a_j)}
$
known as the _softmax function_ and $a_{k}=\ln \left(p\left(\mathbf{x} \mid C_{k}\right) p\left(C_{k}\right)\right)$
## Continuous inputs: Linear Discriminant Analysis
Let's assume the the inputs are continuous and use [[Gaussian Distribution]] to model the class condtional densities:
$
p\left(\mathbf{x} \mid C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{\left|\mathbf{\Sigma}_{k}\right|^{1 / 2}} \exp \left\{\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}_{k}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\}
$
Assuming the classes share the covaraince matrix $\mathbf{\Sigma}_k = \Sigma$, we call this probabilistic generative modelling as _Linear Discriminant Analysis (LDA)_.
For the case of K=2, the class posterior is given as
$
p\left(C_{1} \mid \mathbf{x}\right)=\frac{1}{1+\exp (-a)}=\sigma(a)
$
where
$
\begin{align}
a&=\ln \frac{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)}{p\left(\mathbf{x} \mid C_{2}\right) p\left(C_{2}\right)} \\
&= \ln \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)-\ln \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)} \\
&= -\frac{1}{2} \ln |\boldsymbol{\Sigma}|-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{1}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{1}\right)+\frac{1}{2} \ln \left(\mid\mathbf{\Sigma} \mid+\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{2}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{2}\right)+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)}\right. \\
&= (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T\Sigma^{-1}\mathbf{X} - \frac{1}{2}\boldsymbol{\mu}_1^T\Sigma^{-1}\boldsymbol{\mu}_1 + \frac{1}{2}\boldsymbol{\mu}_2^T\Sigma^{-1}\boldsymbol{\mu}_2 + \ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)}
\end{align}
$
Writing in terms of _generalized linear model_, we have
$
p\left(C_{1} \mid \mathbf{x}\right)=\sigma\left(\mathbf{w}^{T} \mathbf{x}+w_{0}\right)
$
where
$
\begin{array}{l}
\mathbf{w}=\mathbf{\Sigma}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right) \\
w_{0}=-\frac{1}{2} \boldsymbol{\mu}_{1}^{T} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{1}+\frac{1}{2} \boldsymbol{\mu}_{2}^{T} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{2}+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)}
\end{array}
$
We see that the quadratic terms in $\mathbf{x}$ from the exponents of the Gaussian densities have cancelled due to the assumption of common covariances matrices. This leads to a linear function of $\mathbf{x}$ in the argument of the logistic sigmoid. The resulting decision boundaries correspond to surfaces along which the posterior probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ are constant and so will be given by linear functions of $\mathbf{x}$ and therefore the decision boundaries are linear in input space. The prior probabilities enter only through the bias parameter $w_0$ so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability.
![[Probabilistic generative models.jpg]]
For the general case of $K$ classes we have
$
p\left(\mathbf{x} \mid C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\mathbf{\Sigma}|^{1 / 2}} \exp \left\{\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\}
$
$
p\left(C_{k} \mid \mathbf{x}\right)=\frac{\exp \left(a_{k}(\mathbf{x})\right)}{\sum_{j=1}^{K} \exp \left(a_{j}(\mathbf{x})\right)}
$
$
a_{k}(\mathbf{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0}
$
where we have
$
\begin{aligned}
\mathbf{w}_{k} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k} \\
w_{k 0} &=-\frac{1}{2} \boldsymbol{\mu}_{k}^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k}+\ln p\left(\mathcal{C}_{k}\right)
\end{aligned}
$
The decision boundary is $p\left(C_{k} \mid \mathbf{x}\right)=p\left(C_{j} \mid \mathbf{x}\right)$ i.e. $a_{k}(x)=a_{j}(x)$
If we relax the assumption of a shared covariance matrix and allow each class conditional density $p(\mathbf{x}|\mathcal{C}_k)$ to have it's own class covariance matrix $\boldsymbol{\Sigma}_{k}$, then the cancellations do not occur and we obtain quadratic terms in $\mathbf{x}$, giving rise to a _quadratic discriminant_.
![[LDA and QDA.jpg]]
### Maximum Likelihood Solution
Let's use maximum likelihood to estimate $\mathbf{\mu}_k$, $\Sigma$ and priors $p(C_k)$.
Denote $p\left(C_{1}\right)=\pi$ and $p\left(C_{2}\right)=1-\pi$
For this binary classification, use $C_1$= 1, $C_2$=0. So we set $t_n$ to select for each classes. Then we have for $C_1$
$
p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n} \mid \mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)
$
And for $C_2$
$
p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} \mid \mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)
$
Thus the likelihood is given by
$
\begin{align}
p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) &= \prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)\right]^{1-t_{n}}
\end{align}
$
Taking the log of the likelihood,
$
\begin{align}
\ln p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) &= \sum_{n=1}^{N} t_{n} \ln \pi+t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)+ \left(1-t_{n}\right) \ln (1-\pi)+\left(1-t_{n}\right) \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)
\end{align}
$
Taking only the terms that depend on $\pi$, we get
$
\sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\}
$
Taking the derrivative and rearranging,
$
\pi_{ML}=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}}
$
where $N_1$ is the total number of $C_1$ points and similar for $N_2$.
Now to estimate $\mu_1$, let's picl out the terms dependant on $\mu_1$ form the log likelihood function and taking derrivative
$
\begin{align}
\frac{\partial}{\partial \boldsymbol{\mu}_{1}} \sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right) &= -\frac{1}{2} \frac{\partial}{\partial \boldsymbol{\mu}_{1}} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)
\end{align}
$
Setting to zero, we have
$
\mu_{1}=\frac{1}{N_{1}} \sum_{n=1}^{N} t_{n} \mathbf{x}_{n}
$
Similarly for $\mu_2$ is given by
$
\boldsymbol{\mu}_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N}\left(1-t_{n}\right) \mathbf{x}_{n}
$
For the likelihood solution of the shared covariance matrix $\Sigma$, we have,
$
\begin{align}
\frac{\partial}{\partial \boldsymbol{\Sigma}} \ln p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)&=\frac{\partial}{\partial \boldsymbol{\Sigma}}\left[-\frac{N}{2} \ln |\boldsymbol{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\right.\\
& -\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right) = 0
\end{align}
$
rearraning, we have
$
\begin{aligned}
\mathbf{S} &=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2} \\
\mathbf{S}_{1} &=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \\
\mathbf{S}_{2} &=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}}
\end{aligned}
$
### Prediction
For new datapoint $\mathbf{x'}$
$
\begin{align}
p\left(C_{1} \mid \mathbf{x}^{\prime}\right)=\sigma\left(\mathbf{w}_{\mathrm{ML}}^{T} \mathbf{x}^{\prime}+w_{0, \mathrm{ML}}\right)
\end{align}
$
where,
$
\begin{aligned}
\mathbf{w}_{\mathrm{ML}} &=\mathbf{\Sigma}_{\mathrm{ML}}^{-1}\left(\boldsymbol{\mu}_{1, \mathrm{ML}}-\boldsymbol{\mu}_{2, \mathrm{ML}}\right) \\
w_{0, \mathrm{ML}} &=-\frac{1}{2} \boldsymbol{\mu}_{1, \mathrm{ML}}^{T} \mathbf{\Sigma}_{\mathrm{ML}}^{-1} \boldsymbol{\mu}_{1, \mathrm{ML}}+\frac{1}{2} \boldsymbol{\mu}_{2, \mathrm{ML}}^{T} \mathbf{\Sigma}_{\mathrm{ML}}^{-1} \boldsymbol{\mu}_{2, \mathrm{ML}}+\ln \frac{\pi_{\mathrm{ML}}}{1-\pi_{\mathrm{ML}}}
\end{aligned}
$
and assing $\mathbf{x'}$ to $C_1$ if $p\left(C_{1} \mid \mathbf{x}^{\prime}\right) \geq \frac{1}{2}$
### Disadvantages of LDA
1. Gaussian distribution is sensitive to outliers
2. Lineariry/handcrafted features restrict application
3. Maximum likelihood is prone to overfitting
## Discrete inputs: Naive Bayes
Let's consider discrete feature vectors $\mathbf{x}_{n}=\left(x_{1}, \ldots, x_{D}\right)^{T}$ with $x_{i} \in\{0,1\}$. Then for D-dimensional input, the number of parameters become $2^D-1$ as we need a parameter to model relationship between each pair of features for each class.
The likelihood can be expressed as
$
\begin{align}
p(\mathbf{T}, \mathbf{X} \mid \lambda)&=\prod_{n=1}^{N} p\left(t_{n}, x_{n} \mid \lambda\right) \\
&=\prod_{n=1}^{N} p\left(t_{n} \mid \lambda\right) \cdot p\left(x_{n} \mid t_{n}, \lambda\right) \\
&=\prod_{n=1}^{N} \prod_{k=1}^{K}\left[p\left(\mathcal{C}_{k} \mid \lambda\right) \cdot p\left(x_{n} \mid \mathcal{C}_{k}, \lambda\right)\right]^{t_{n k}}
\end{align}
$
To seek a more restricted representation, we make the _naive Bayes_ assumption that the feature values are treated as independant **when conditioned on the class $C_k$**.
$
\begin{align}
&=\prod_{n=1}^{N} \prod_{k=1}^{K}\left[p\left(\mathcal{C}_{k}\right) \cdot \prod_{d=1}^{D} p\left(x_{n d} \mid \mathcal{C}_{k}, \lambda_{d k}\right)\right]^{t_{n k}} \\
&= \prod_{n=1}^{N} \prod_{k=1}^{K}\left[\pi_{k} \cdot \prod_{d=1}^{D} p\left(x_{n d} \mid \mathcal{C}_{k}, \lambda_{d k}\right)\right]^{t_{n k}}
\end{align}
$
In this case, the number of paramters per class is $D$, so total number of parameters is $K.D$. Note again that the features are not marginally independant, only when conditioned on the classes.
For binary cases, we can model with bernoulli distribution:
$
p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)=\prod_{i=1}^{D} \mu_{k i}^{x_{i}}\left(1-\mu_{k i}\right)^{1-x_{i}}
$
Posterior class probability of the form
$
p\left(C_{k} \mid \mathbf{x}\right)=\frac{\exp \left(a_{k}(\mathbf{x})\right)}{\sum_{j=1}^{K} \exp \left(a_{j}(\mathbf{x})\right)}
$
where,
$
\begin{align}
a_{k}(\mathbf{x})&=\ln p\left(x \mid C_{k}\right) p\left(C_{k}\right)\\
&= \sum_{i=1}^{D}\left\{x_{i} \ln \mu_{k i}+\left(1-x_{i}\right) \ln \left(1-\mu_{k i}\right)\right\}+\ln p\left(\mathcal{C}_{k}\right) \\
\end{align}
$
## References
1. Bishop 4.2
2. http://mlss2018.net.ar/slides/Adams-1.pdf