KL Divergence - Notes on AI

# KL Divergence KL divergence or relative entropy is a measure of how one probability distribution is different from a reference probability distribution. $ D_{\mathrm{KL}}(q \| p)=-\mathbb{E}_{q(x)}\left[\log \frac{p(X)}{q(X)}\right]=-\int q(x)\left[\log \frac{p(x)}{q(x)}\right] d x $ ## Properties 1. It is asymmetric metric and thus cant be used as a distance metric. 2. KL divergence 0 indicates that we can expect similar, if not the same, behavior of two different distributions, while 1 indicates that the two distributions behave in such a different manner that the expectation given the first distribution approaches zero. ## Forward and backward KL Assumpe p is true distribution and we want to approximate it with q. ![[kl-diffs.jpg]] [Image Credit](https://arxiv.org/pdf/1804.00140.pdf) Forward $\mathrm{KL}: K L(p \| q)=\int p \log \frac{p}{q} d z,$ In this case, the model will try to avoid placing zero probability mass anywhere where $p>0$ since this would lead to an exploding KL. This means it is safer to place non-zero $q$ anywhere where plausibly $p>0$ which leads to overestimation. This case is called zero-avoiding. Reverse $\mathrm{KL}: K L(q \| p)=\int q \log \frac{q}{p} d z,$ In this case, the model will try to avoid situations where $q \approx 0$ and $p>0,$ thus it is safer to choose a mode and underestimate variance, rather than overestimate and risking to overshoot. This is called zero-forcing. ![[kl-1.jpg]] ![[kl-2.jpg]] Above visualizations from cool interactive demo: https://observablehq.com/@stwind/forward-and-reverse-kl-divergences ## KL divergence with unit gaussian With a zero mean and unit variance [[Gaussian Distribution]]: $p=\mathcal{N}(0,1)$ such as [[Variational Autoencoders]] prior, we can actually find a closed-form solution of the $\mathrm{KL}$ divergence: $ \begin{aligned} K L(q, p) &=-\int q(x) \log p(x) d x+\int q(x) \log p(x) d x \\ &=\frac{1}{2} \log \left(2 \pi \sigma_{p}^{2}\right)+\frac{\sigma_{q}^{2}+\left(\mu_{q}-\mu_{p}\right)^{2}}{2 \sigma_{p}^{2}}-\frac{1}{2}\left(1+\log 2 \pi \sigma_{q}^{2}\right) \\ &=\log \frac{\sigma_{p}}{\sigma_{q}}+\frac{\sigma_{q}^{2}+\left(\mu_{q}-\mu_{p}\right)^{2}}{2 \sigma_{p}^{2}}-\frac{1}{2} \\ &=\frac{\sigma_{q}^{2}+\mu_{q}^{2}-1-\log \sigma_{q}^{2}}{2} \end{aligned} $ ## Relationship with MLE and Cross Entropy Why is KL divergence referenced so much in machine learning? One reason is that it can be shown that [[Maximum Likelihood Estimation]] of data under a model is the same as minimizing the KL divergence between the data distribution and the model distribution i.e. $\left.D_{\mathrm{KL}}\left(p_{\text {data }} \| p_{\theta}\right)\right)$. Minimizing KL divergence between two distributions corresponds to minimizing the [[Cross entropy]] between the distributions, and can be used equivalently when the entropy one distribution is stationary i.e. training data. --- ## References 1. Notes on GAN objective functions by Daniel C Elton http://www.moreisdifferent.com/assets/science_notes/notes_on_GAN_objective_functions.pdf