# Contrastive Divergence
To motivate contrastive divergence, we revisit [[Maximum Likelihood Estimation]] (Note: [[KL Divergence#Relationship with MLE and Cross Entropy]])
$
\mathrm{KL}\left(p_{0} \| p_{\infty}\right)=\int p_{0} \log p_{0}-\int p_{0} \log p_{\infty} \propto-\int p_{0} \log \mathrm{p}_{\infty}
$
Contrastive divergence minimizes
$
\mathrm{CD}_{n}=\mathrm{KL}\left(p_{0} \| p_{\infty}\right)-\mathrm{KL}\left(p_{n} \| p_{\infty}\right)
$
Updates weights using CD $_{n}$ gradients instead of ML gradients
$
\frac{d}{\partial \boldsymbol{\theta}} \mathrm{CD}_{n}=-\mathbb{E}_{0}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})\right]+\mathbb{E}_{n}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]+\frac{d}{\partial \boldsymbol{\theta}}[\ldots]
$
where $\mathbb{E}_{n}$ is computed by sampling after $n$ steps in the Markov Chain. The last term is small and can be ignored.
### Intuition
Make sure after $n$ sampling step not far from data distribution
- Usually, one step only $(n=1)$ is enough
- Something similar to 'minimizing reconstruction error'
Because of conditional independence of $x \mid v$ and $v \mid x$ -> parallel computations
- Sample a data point $x$
- Compute the posterior $\boldsymbol{p}(\boldsymbol{v} \mid \boldsymbol{x})$
- Take sample of latents $\boldsymbol{v} \sim \boldsymbol{p}(\boldsymbol{v} \mid \boldsymbol{x})$
- Compute the conditional $p(x \mid v)$
- Sample from $x^{\prime} \sim p(x \mid v)$
- Minimize difference using $x, x^{\prime}$
---
## References