Tags: #notesonai #mnemonic Topics: [[Information Theory]], [[Probability Theory]] ID: 20201223212615 --- # Jensen–Shannon Divergence JS divergence is the weighted sum of [[KL Divergence#Forward and backward KL]]. This makes it symmetric unlike KL divergence. It is defined as $ D_{J S}(p \| q)=\frac{1}{2} D_{K L}\left(p \| \frac{p+q}{2}\right)+\frac{1}{2} D_{K L}\left(q \| \frac{p+q}{2}\right) $ Why is [[KL Divergence]] used more than JS divergence? [Answers](https://www.quora.com/Why-isnt-the-Jensen-Shannon-divergence-used-more-often-than-the-Kullback-Leibler-since-JS-is-symmetric-thus-possibly-a-better-indicator-of-distance) While KL divergence measures the similarity between two distributions, it violates both symmetry $\delta(p, q)=(q, p)$ and violates the triangle inequality, $\delta(a, b)+\delta(b, c) \geq \delta(a, c)$, so it is not a metric. Jenson-Shannon divergence is a metric and therefore is sometimes called Jenson-Shannon distance. JS distance is used to solve the issues of KL divergence in high dimensional spaces, for example [[Generative Adversarial Networks]]. Empirically, most high dimensional real world data lies close to a low dimensional manifold. Therefore, when optimizing KL divergence of model distribution between data distribution, they rarely overlap-it would be like finding a needle in a haystack. If there is little or no intersection between distributions, KL divergence become infinite and the gradient signal becomes zero. JS divergence is better behaved in the sense that it doesn't become infinite when $p_{\theta}(\boldsymbol{x})=0$ but suffers from the same problem of having zero gradient when there is little or no overlap. --- ## References 1. http://www.moreisdifferent.com/assets/science_notes/notes_on_GAN_objective_functions.pdf