# Variational Autoencoders VAE's leverage the flexibility of neural networks to learn [[Latenent Variable Models|latent representations]]. A latent variable model define a distribution $p(x)$ over $x$ as $p(x)=\int p(x \mid z) p(z) d z$, which is usually expensive to calculate. VAE provide solutions to these two difficulties of latent variable models: 1. How to define the latent variables $z$ 2. How to deal with integral over $z$ VAE exploits two important properties of Gaussians: [[Gaussian Distribution#Approximating other distributions]] for modeling latent variables and [[Gaussian Distribution#Reparameterization Trick]] for applying backpropagation. ## Defining the latent model VAEs specify the latent variable model as $ \begin{array}{l} z_{n} \sim \mathcal{N}\left(0, I_{D}\right) \\ x_{n} \sim p_{X}\left(f_{\theta}\left(z_{n}\right)\right) \end{array} $ where $f_{\theta}$ is some function - parameterized by $\theta$ - that maps $z_{n}$ to the parameters of a distribution over $x_{n}$. This function is specified using (deep) [[Neural Networks]]. How is it possible to using a simple unit gaussian distribution? $p(Z)$ has no trainable parameter. Isn't it too restrictive in practice? This is because any distribution in d dimensions can be generated by taking a set of d variables that are normally distributed and mapping them through a sufficiently complicated function (ex: neural networks). ## The Decoder - Defining the generative model Assume our dataset is a image dataset. Let's assume the pixels of our images $x_{n}$ in $\mathcal{D}$ are Bernoulli $(p)$ distributed $ \begin{aligned} p\left(z_{n}\right) &=\mathcal{N}\left(0, \boldsymbol{I}_{D}\right) \\ p\left(\boldsymbol{x}_{n} \mid \boldsymbol{z}_{n}\right) &=\prod_{m=1}^{M} \operatorname{Bern}\left(\boldsymbol{x}_{n}^{(m)} \mid f_{\theta}\left(\boldsymbol{z}_{n}\right)_{m}\right) \end{aligned} $ where $x_{n}^{(m)}$ is the $m$ -th pixel of the $n$ -th image in $\mathcal{D},$ and $f_{\theta}: \mathbb{R}^{D} \rightarrow[0,1]^{M}$ is a neural network parameterized by $\theta$ that outputs the means of the Bernoulli distributions for each pixel in $x_{n}$. Now that we have defined the model, we can write out an expression for the log probability of the data $\mathcal{D}$ under this model: $ \begin{aligned} \log p(\mathcal{D}) &=\sum_{n=1}^{N} \log p\left(\boldsymbol{x}_{n}\right) \\ &=\sum_{n=1}^{N} \log \int p\left(\boldsymbol{x}_{n} \mid \boldsymbol{z}_{n}\right) p\left(\boldsymbol{z}_{n}\right) d \boldsymbol{z}_{n} \\ &=\sum_{n=1}^{N} \log \mathbb{E}_{p\left(z_{n}\right)}\left[p\left(\boldsymbol{x}_{n} \mid \boldsymbol{z}_{n}\right)\right] \qquad (1) \end{aligned} $ Evaluating this is very expensive. We could use Monte-Carlo Integration to approximate by drawing samples (latent vectors) $\boldsymbol{z}_{n}^{(l)}$ from $p\left(\boldsymbol{z}_{n}\right)$ as: $ \begin{aligned} \log p\left(x_{n}\right) &=\log \mathbb{E}_{p\left(z_{n}\right)}\left[p\left(x_{n} \mid z_{n}\right)\right] \\ & \approx \log \frac{1}{L} \sum_{l=1}^{L} p\left(x_{n} \mid z_{n}^{(l)}\right), \quad z_{n}^{(l)} \sim p\left(z_{n}\right) \end{aligned} $ However, Monte carlo Integration is not used for training VAE because it's inefficiency scales with dimensionality of $\mathbf{z}$. ## The Encoder - Approximating the integral with variational distribution The posterior $p\left(z_{n} \mid x_{n}\right)$ is as difficult to compute as $p(\mathbf{x}_n)$ itself. VAE solves this problem by learning an approximate posterior distribution $q\left(z_{n} \mid x_{n}\right)$, which is refered to as variational distribution. This allows us to efficiently integrate $p(x) = \int p\left(x_{n} \mid z_{n}\right) p\left(z_{n}\right) d z_{n}$. Now we can derrive an efficient bound on the log likelihood $\log p(\mathcal{D})$. We continue from the equation 1 above and use [[Jensen's Inequality]] and [[KL Divergence]] to derive the formulation: $ \begin{aligned} &\begin{aligned} \log p\left(x_{n}\right) &=\log \mathbb{E}_{p\left(z_{n}\right)}\left[p\left(x_{n} \mid z_{n}\right)\right] \\ &=\log \mathbb{E}_{p\left(z_{n}\right)}\left[\frac{q\left(z_{n} \mid x_{n}\right)}{q\left(z_{n} \mid x_{n}\right)} p\left(x_{n} \mid z_{n}\right)\right] \quad\left(\text { multiply by } q\left(z_{n} \mid x_{n}\right) / q\left(z_{n} \mid x_{n}\right)\right) \end{aligned}\\ &=\log \mathbb{E}_{q\left(z_{n} \mid x_{n}\right)}\left[\frac{p\left(z_{n}\right)}{q\left(z_{n} \mid x_{n}\right)} p\left(x_{n} \mid z_{n}\right)\right] \quad(\text { switch expectation distribution })\\ &\geq \mathbb{E}_{q\left(z_{n} \mid x_{n}\right)} \log \left[\frac{p\left(z_{n}\right)}{q\left(z_{n} \mid x_{n}\right)} p\left(x_{n} \mid z_{n}\right)\right] \quad(\text { Jensen's inequality })\\ &=\mathbb{E}_{q\left(z_{n} \mid x_{n}\right)}\left[\log p\left(x_{n} \mid z_{n}\right)\right]+\mathbb{E}_{q\left(z_{n} \mid x_{n}\right)} \log \left[\frac{p\left(z_{n}\right)}{q\left(z_{n} \mid x_{n}\right)}\right] \quad(\text { re-arranging })\\ &=\underbrace{\mathbb{E}_{q\left(z_{n} \mid x_{n}\right)}\left[\log p\left(x_{n} \mid z_{n}\right)\right]-K L\left(q\left(Z \mid x_{n}\right) \| p(Z)\right)}_{\text {Expected Lower Bound (ELBO) }} \text { (writing 2nd term as KL)} \end{aligned} $ Therefore, we have, $ \log p\left(x_{n}\right) \geq \mathbb{E}_{q\left(z_{n} \mid x_{n}\right)}\left[\log p\left(x_{n} \mid z_{n}\right)\right]-K L\left(q\left(Z \mid x_{n}\right) \| p(Z)\right) \qquad (2) $ The left hand side quantity $\log p(x_n)$ is the quantity that we want to maximize, and the right hand side is something we can optimize via stochastic gradient descent, which is awesome! This right hand side quantity is called _evidence lower bound (ELBO) on the log-probability of data. From an alternate derivation, we have $ \log p\left(x_{n}\right)-K L\left(q\left(Z \mid x_{n}\right) \| p\left(Z \mid x_{n}\right)\right)=\mathbb{E}_{q\left(z_{n} \mid x_{n}\right)}\left[\log p\left(x_{n} \mid z_{n}\right)\right]-K L\left(q\left(Z \mid x_{n}\right) \| p(Z)\right) $ We can see that the gap between ELBO and log-probability of data is exactly equal to the KL term on the right hand side. We can also see that while maximizing $\log p(x)$, we are also minimizing this KL term. $p(z|x)$ is not something we can compute analytically, but by minimizing this KL term we are bringing $q(z|x)$ closer to it. Therefore $\log p(x)$ in the original formulation is the lower bound as $\log p(x)$ is equal to ELBO + $KL(q(z|x)||p(z|x))$. If we use high-capacity model for $q(z|x)$, hopefully it will actually match $p(z|x)$ which makes the KL term zero. In that case we directly optimize $\log p(x)$. Incidentally, we have also made $p(z|x)$ tractable by using $q(z|x)$ and $\log p(x)$ to compute it. Finally, we can define our loss as the mean negative lower bound over samples: $ \mathcal{L}(\theta, \phi)=-\frac{1}{N} \sum_{n=1}^{N} \mathbb{E}_{q_{\phi}\left(z \mid x_{n}\right)}\left[\log p_{\theta}\left(x_{n} \mid Z\right)\right]-D_{\mathrm{KL}}\left(q_{\phi}\left(Z \mid x_{n}\right) \| p_{\theta}(Z)\right) $ where $\theta$ is the generative parameter (decoder) and $\phi$ the variational parameter. This is also usually rewritten in terms of per-sample losses: $ \mathcal{L}=\frac{1}{N} \sum_{n=1}^{N}\left(\mathcal{L}_{n}^{\mathrm{recon}}+\mathcal{L}_{n}^{\mathrm{reg}}\right) $ where, $ \begin{aligned} \mathcal{L}_{n}^{\text {recon }} &=-\mathbb{E}_{q_{\phi}\left(z \mid x_{n}\right)}\left[\log p_{\theta}\left(x_{n} \mid Z\right)\right] \\ \mathcal{L}_{n}^{\text {reg }} &=D_{\mathrm{KL}}\left(q_{\phi}\left(Z \mid x_{n}\right)|| p_{\theta}(Z)\right) \end{aligned} $ ## Choosing q and p In VAE, we parameterize q and p by neural networks, mainly because they are extememly expressive function approximators that can be optimized over large datasets. But what does it mean to parameterize a distribution with a neural network? Typically, we assume $q$ is a normal distribution (we can choose any that is flexible enough to represent richess of the data) i.e. $ q_{\phi}\left(z_{n} \mid x_{n}\right)=\mathcal{N}\left(z_{n} \mid \mu_{\phi}\left(x_{n}\right), \operatorname{diag}\left(\Sigma_{\phi}\left(x_{n}\right)\right)\right) $ This means our neural network will output two vectors $\mu_{\phi}\left(x_{n}\right)$ and $\Sigma_{\phi}\left(x_{n}\right)$ which we plug into the normal distribution to obtain $q$. ## Reparameterization trick Our optimization objective is, $ E_{X \sim D}\left[E_{q_\phi(z|x)}[\log p(x \mid z)]-\mathcal{KL}[q(z | x) \| p(z)]\right] $ To train VAE reliably, we need to update parameters of $q$ so that it produces appropriate $z$ for $p$ to reconstruct, which means we need to update $q$ with the gradient of $E_{q_{\phi}(z|x)}[\log p(x|z)]$ as well. The difficulty arises from the fact that we are we are sampling $z$, which is a non-differentiable operation and has no gradient. SGD can handle stochastic inputs, but not stochastic layers! Monte Carlo is not possible $ \nabla_{\varphi} \mathbb{E}_{z \sim q_{\varphi}(z \mid x)}\left[\log p_{\theta}(x \mid z)\right]=\int_{z} \nabla_{\varphi}\left[q_{\varphi}(z \mid x)\right] \log p_{\theta}(x \mid z) d z $ We see there is no density to sample from. $\nabla_{\varphi}\left[q_{\varphi}(z \mid x)\right]$ is the gradient of a density function and $\log p_{\theta}(x \mid z)$ is the logarithm of a density function. How can we turn this expression to Monte Carlo friendly? The solution is to use the [[Gaussian Distribution#Reparameterization Trick]] of gaussians to make sampling a continuous operation. We can rewrite the gradient then as: $ \begin{array}{l} \nabla_{\boldsymbol{\varphi}} \mathbb{E}_{z \sim q_{\varphi}(z \mid x)}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \mathbf{z})\right]=\nabla_{\varphi} \int_{\boldsymbol{Z}} \log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \mathbf{z}) q_{\varphi}(z \mid x) d \boldsymbol{z} \\ =\nabla_{\varphi} \int_{\varepsilon} \log p_{\boldsymbol{\theta}}\left(x \mid \boldsymbol{\mu}_{z, \varphi}, \boldsymbol{\sigma}_{z, \varphi}, \varepsilon\right) q(\varepsilon) d \varepsilon \\ =\int_{\varepsilon} \nabla_{\varphi} \log p_{\boldsymbol{\theta}}\left(x \mid \boldsymbol{\mu}_{z, \varphi}, \boldsymbol{\sigma}_{z, \varphi}, \varepsilon\right) q(\varepsilon) d \varepsilon \\ \approx \sum_{k} \nabla_{\varphi} \log p_{\boldsymbol{\theta}}\left(x \mid \boldsymbol{\mu}_{z, \varphi}, \boldsymbol{\sigma}_{z, \varphi}, \varepsilon_{k}\right), \varepsilon_{k} \sim N(0,1) \end{array} $ Which means the sampling in MC integration does not depend on encoder distribution anymore. Sampling directly from $\varepsilon \sim N(0,1)$ leads to low-variance estimates compared to sampling directly from $z \sim N\left(\mu_{Z}, \sigma_{Z}\right)$ as our neural network now knows stocahsticity comes from very specific random source of epsilon. Remember: we are sampling for $z \rightarrow$ we are also sampling gradients Stochastic gradient estimator Remember our mean and std functions have deterministic outputs, so we can formulate the latent variable as $ z=\mu(X)+\Sigma^{1 / 2}(X)*\epsilon $ where $\epsilon \sim \mathcal{N}(0, I)$. Now, the randomness is not associated with the neural network and its parameters that we have to learn. The randomness instead comes from the external $\varepsilon$ and the gradients can flow through $\mu_{z}$ and $\sigma_{z}$. This makes the computational graph deterministic and allows backpropagation to work without any problems. ![[vae-reparameterization-trick.jpg]] --- ## References 1. Auto-Encoding Variational Bayes - Original paper https://arxiv.org/abs/1312.6114 2. An introduction to Variational Autoencoders - Detailed paper by the original authors https://arxiv.org/pdf/1906.02691.pdf 3. Tutorial on Variational Autoencoders by Carl Doersch https://arxiv.org/pdf/1606.05908.pdf 4. Notes on PGM from Stanford CS228 https://ermongroup.github.io/cs228-notes/