Contrastive Divergence

# Contrastive Divergence To motivate contrastive divergence, we revisit [[Maximum Likelihood Estimation]] (Note: [[KL Divergence#Relationship with MLE and Cross Entropy]]) $ \mathrm{KL}\left(p_{0} \| p_{\infty}\right)=\int p_{0} \log p_{0}-\int p_{0} \log p_{\infty} \propto-\int p_{0} \log \mathrm{p}_{\infty} $ Contrastive divergence minimizes $ \mathrm{CD}_{n}=\mathrm{KL}\left(p_{0} \| p_{\infty}\right)-\mathrm{KL}\left(p_{n} \| p_{\infty}\right) $ Updates weights using CD $_{n}$ gradients instead of ML gradients $ \frac{d}{\partial \boldsymbol{\theta}} \mathrm{CD}_{n}=-\mathbb{E}_{0}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})\right]+\mathbb{E}_{n}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]+\frac{d}{\partial \boldsymbol{\theta}}[\ldots] $ where $\mathbb{E}_{n}$ is computed by sampling after $n$ steps in the Markov Chain. The last term is small and can be ignored. ### Intuition Make sure after $n$ sampling step not far from data distribution - Usually, one step only $(n=1)$ is enough - Something similar to 'minimizing reconstruction error' Because of conditional independence of $x \mid v$ and $v \mid x$ -> parallel computations - Sample a data point $x$ - Compute the posterior $\boldsymbol{p}(\boldsymbol{v} \mid \boldsymbol{x})$ - Take sample of latents $\boldsymbol{v} \sim \boldsymbol{p}(\boldsymbol{v} \mid \boldsymbol{x})$ - Compute the conditional $p(x \mid v)$ - Sample from $x^{\prime} \sim p(x \mid v)$ - Minimize difference using $x, x^{\prime}$ --- ## References