# Why Generative Models
[[Discriminant Functions|Discriminative models]] frame the problem of:
Given an individual input $x$ predict
- the correct label (classification)
- the correct score (regression)
They are optimized by maximizing the probability of _individual_ targets
In [[Probabilistic Generative Models]], we model the data _jointly_ i.e we want to know what is the distribution of data. For instance, we want to know how likely is $x_a$, or if it is more likely than $x_b$.
![[data-distribution.jpg]]
## Why/when to learn a distribution?
- Density estimation: estimate the probability of $x$
- Sampling: generate new plausible $x$, E.g., [[Model Based Reinforcement Learning]]
- Structure/representation learning; learn good features of $x$ unsupervised
- Generative models are widely to pretrain for downstream task.
- Generative models to ensure generalization E.g., [[Model Based Reinforcement Learning]], [[Semi Supervised Learning]], Simulations
## The world as a distribution
![[world-distribution.jpg]]
## Challenges
1. We are interested in parametric models from a family of models $\mathcal{M}$.
2. How to pick the right family of models $\mathcal{M}$?
3. How to know which $\theta$ from $\mathcal{M}$ is a good one?
4. How to learn/optimize our models from family $\mathcal{M} ?$
## Properties for modelling distributions
We want to learn distributions $p_{\theta}(x)$
Our model must the refore have the following properties
- Non-negativity $\left(p_{\theta}(x) \geq 0 \forall x\right.$
- Probabilities of allevents must sum up to $1: \int_{x} p_{\theta}(x) d x=1$
Summing up to 1 (normalization) makes sure predictions improve relatively
- Model cannot trivially get better scores by predicting higher numbers
- The pie remains the same -> model forced to make non-trivial improvements
Easy to obtain non-negativity
- Consider: $g_{\theta}(x)=f_{\theta}^{2}(x)$ where $f_{\theta}$ is a neural network
- Or $g_{\theta}(x)=\exp \left(f_{\theta}(x)\right)$
But they do not sum up to 1. What can we do?
Normalize by the total volume of the function
$
p_{\theta}(x)=\frac{1}{\text { volume }\left(g_{\theta}\right)} g_{\theta}(x)=\frac{1}{\int_{x} g_{\theta}(x) d x} g_{\theta}(x)
$
In simple words, equivalent to normalizing (3,1,4) as $\frac{1}{3+1+4}[3,1,4]$
Examples:
$g_{\theta=(\mu, \sigma)}(x)=\exp \left(-(x-\mu)^{2} / 2 \sigma^{2}\right) \Rightarrow$ Volume $\left(g_{\theta}\right)=\sqrt{2 \pi \sigma^{2}} \Rightarrow$ Gaussian
$g_{\theta=\lambda}(x)=\exp (-\lambda x) \Rightarrow$ Volume $\left(g_{\theta}\right)=\frac{1}{\lambda} \Rightarrow$ Exponential
Must find convenient $g_{\theta}$ to be able to compute the integral analytically. Otherwise we cannot make sure of valid probabilities.
## Why is learning a distribution hard?
The integrals mean that learning distributions becomes harder with scale
Think of $300 \times 400$ color images with (0,256) color range
- The number of possible images $x$ is $256^{3 \cdot 300 \cdot 400}$
- In principle we must assign a probability to all of them
While easy to define a family of models, we got a $\int_{x} g_{\theta}(x) d x$
- Not always easy how to sample (needed for evaluating)
- Not always easy how to optimize (needed for training)
- Not always data efficient (long training times)
- Not always sample efficient (many samples needed for accuracy)
## Why/when not to learn a distibution?
> "One should solve the [classification] problem directly and never solve a more general [and harder] problem as an intermediate step." ~ V. Vapnik, father of SVMs.
Generative models to be preferred
- when probabilities are important
- when you got no human annotations and want to learn features
- when you want to generalize to (many) downstream tasks
- when the answer to your question is not: "more data"
If you have a very specific classification task and lots of data
- no need to make things complicated
## Map of generative models
![[map-of-generative models.jpg]]
---
## References
1. Lecture 8, UvA DL course 2020