Why Generative Models

# Why Generative Models [[Discriminant Functions|Discriminative models]] frame the problem of: Given an individual input $x$ predict - the correct label (classification) - the correct score (regression) They are optimized by maximizing the probability of _individual_ targets In [[Probabilistic Generative Models]], we model the data _jointly_ i.e we want to know what is the distribution of data. For instance, we want to know how likely is $x_a$, or if it is more likely than $x_b$. ![[data-distribution.jpg]] ## Why/when to learn a distribution? - Density estimation: estimate the probability of $x$ - Sampling: generate new plausible $x$, E.g., [[Model Based Reinforcement Learning]] - Structure/representation learning; learn good features of $x$ unsupervised - Generative models are widely to pretrain for downstream task. - Generative models to ensure generalization E.g., [[Model Based Reinforcement Learning]], [[Semi Supervised Learning]], Simulations ## The world as a distribution ![[world-distribution.jpg]] ## Challenges 1. We are interested in parametric models from a family of models $\mathcal{M}$. 2. How to pick the right family of models $\mathcal{M}$? 3. How to know which $\theta$ from $\mathcal{M}$ is a good one? 4. How to learn/optimize our models from family $\mathcal{M} ?$ ## Properties for modelling distributions We want to learn distributions $p_{\theta}(x)$ Our model must the refore have the following properties - Non-negativity $\left(p_{\theta}(x) \geq 0 \forall x\right.$ - Probabilities of allevents must sum up to $1: \int_{x} p_{\theta}(x) d x=1$ Summing up to 1 (normalization) makes sure predictions improve relatively - Model cannot trivially get better scores by predicting higher numbers - The pie remains the same -> model forced to make non-trivial improvements Easy to obtain non-negativity - Consider: $g_{\theta}(x)=f_{\theta}^{2}(x)$ where $f_{\theta}$ is a neural network - Or $g_{\theta}(x)=\exp \left(f_{\theta}(x)\right)$ But they do not sum up to 1. What can we do? Normalize by the total volume of the function $ p_{\theta}(x)=\frac{1}{\text { volume }\left(g_{\theta}\right)} g_{\theta}(x)=\frac{1}{\int_{x} g_{\theta}(x) d x} g_{\theta}(x) $ In simple words, equivalent to normalizing (3,1,4) as $\frac{1}{3+1+4}[3,1,4]$ Examples: $g_{\theta=(\mu, \sigma)}(x)=\exp \left(-(x-\mu)^{2} / 2 \sigma^{2}\right) \Rightarrow$ Volume $\left(g_{\theta}\right)=\sqrt{2 \pi \sigma^{2}} \Rightarrow$ Gaussian $g_{\theta=\lambda}(x)=\exp (-\lambda x) \Rightarrow$ Volume $\left(g_{\theta}\right)=\frac{1}{\lambda} \Rightarrow$ Exponential Must find convenient $g_{\theta}$ to be able to compute the integral analytically. Otherwise we cannot make sure of valid probabilities. ## Why is learning a distribution hard? The integrals mean that learning distributions becomes harder with scale Think of $300 \times 400$ color images with (0,256) color range - The number of possible images $x$ is $256^{3 \cdot 300 \cdot 400}$ - In principle we must assign a probability to all of them While easy to define a family of models, we got a $\int_{x} g_{\theta}(x) d x$ - Not always easy how to sample (needed for evaluating) - Not always easy how to optimize (needed for training) - Not always data efficient (long training times) - Not always sample efficient (many samples needed for accuracy) ## Why/when not to learn a distibution? > "One should solve the [classification] problem directly and never solve a more general [and harder] problem as an intermediate step." ~ V. Vapnik, father of SVMs. Generative models to be preferred - when probabilities are important - when you got no human annotations and want to learn features - when you want to generalize to (many) downstream tasks - when the answer to your question is not: "more data" If you have a very specific classification task and lots of data - no need to make things complicated ## Map of generative models ![[map-of-generative models.jpg]] --- ## References 1. Lecture 8, UvA DL course 2020