# Boltzmann Machines
[[Hopfield Networks]] minimize the quadratic energy function
$
E=-f_{\theta}(x)=-\left(\sum_{i, j} w_{i j} x_{i} x_{j}+\sum_{i} b_{i} x_{i}\right)
$
Boltzmann machines are stochastic Hopfield networks. In Boltzmann machines the neuron response on activation $a_{i}$ is
$
x_{i}=\left\{\begin{array}{l}+1 \text { with probability } 1 / 1+\exp \left(-2 a_{i}\right) \\ -1 \quad \text { otherwise }\end{array}\right.
$
## Restricted Boltzmann Machines
We can interpret Boltzmann netowrks implement gibbs sampling for pdf $p(x)=\frac{1}{z} \exp \left(\frac{1}{2} x^{T} W x\right)$
Boltzmann machines are too parameter heavy For $x$ with $256 \times 256=65536$ the $W$ has 4.2 billion parameters.
Boltzmann machines learn no features. Instead, it add bottleneck latents $v$:
$
E=-f_{\theta}(x)=-\left(\sum_{i, j} w_{i j} x_{i} v_{j}+\sum_{i} b_{i} x_{i}+\sum_{j} c_{j} v_{j}\right)
$
$x_{i}$ and $v_{j}$ are still binary variables in the original model.
- The quadratic term captures correlations
- The unary terms capture priors: how likely is a (latent) pixel to be +1 or -1
Energy function: $E(x)=-x^{T} W v-b^{T} x-c^{T} v$
$
p(x)=\frac{1}{Z} \sum_{v} \exp (-E(x, v))
$
Not in the form $\propto \exp (\mathrm{x}) / \mathrm{Z}$ because of the $\Sigma$
Rewriting in form of Free energy function: $F(x)=-\boldsymbol{b}^{T} \boldsymbol{x}-\sum_{i} \log \sum_{v_{i}} \exp \left(v_{i}\left(c_{i}+\boldsymbol{W}_{i} \boldsymbol{x}\right)\right)$
$
\begin{array}{l}
p(x)=\frac{1}{Z} \exp (-F(x)) \\
Z=\sum_{x} \exp (-F(x))
\end{array}
$
The $F(x)$ defines a bipartite graph with undirected connections. Information flows forward and backward.
![[rbms.jpg]]
The hidden variables $v_{j}$ are independent conditioned on the visible variables
$
p(v \mid x)=\prod_{j} p\left(v_{j} \mid x, \boldsymbol{\theta}\right)
$
The visible variables $x_{i}$ are independent conditioned on the hidden variables
$
p(\boldsymbol{x} \mid \boldsymbol{v})=\prod_{i} p\left(x_{i} \mid \boldsymbol{v}, \boldsymbol{\theta}\right)
$
## Training RBM conditional probabilities
The conditional probabilities are defined as sigmoids
$
\begin{array}{l}
p\left(v_{j} \mid x, \boldsymbol{\theta}\right)=\sigma\left(\boldsymbol{W}_{\cdot j} \boldsymbol{x}+b_{j}\right) \\
p\left(x_{i} \mid \boldsymbol{v}, \boldsymbol{\theta}\right)=\sigma\left(\boldsymbol{v}^{T} \boldsymbol{W}_{i}+c_{i}\right)
\end{array}
$
Since RBMs are bidirectional -> "Loop" between visible and latent, going back and forth and again
$
\begin{array}{l}
v^{(1)} \sim \sigma\left(W_{\cdot j} x^{(0)}+b_{j}\right) \Rightarrow \\
x^{(1)} \sim \sigma\left(W_{\cdot j} v^{(2)}+b_{j}\right) \Rightarrow \\
v^{(2)} \sim \sigma\left(W_{\cdot j} x^{(1)}+b_{j}\right) \Rightarrow \ldots
\end{array}
$
## Training any energy model
Maximizing log-likelihood
$
\mathcal{L}(\boldsymbol{\theta})=\frac{1}{\mathrm{~N}} \sum_{n} \log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}\right)=\mathbb{E}_{p_{0}}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right]
$
The expectation w.r.t. a pdf is equivalent to
- sampling from the pdf and
- then taking the average
$
\mathbb{E}_{x \sim p_{0}}[\log p(x \mid \theta)]=\mathbb{E}_{x \sim p_{0}}\left[-E_{\theta}(x)\right]-\log Z(\theta)
$
- where $\log Z(\boldsymbol{\theta})=\log \sum_{x^{\prime}} \exp \left(-E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$
- and $p_{0}(x)$ is the data distribution
## Taking gradients of any energy model
$\frac{d}{d \boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\boldsymbol{x})=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\frac{d}{d \boldsymbol{\theta}} \log Z(\boldsymbol{\theta})$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\theta}(\boldsymbol{x})-\frac{1}{Z(\boldsymbol{\theta})} \frac{d}{d \boldsymbol{\theta}} Z(\boldsymbol{\theta})$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\sum_{x^{\prime}} \frac{1}{Z(\boldsymbol{\theta})} \exp \left(-E_{\boldsymbol{\theta}}\left[\boldsymbol{x}^{\prime}\right]\right)\left(-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\sum_{\boldsymbol{x}^{\prime}} p_{\theta}\left(\boldsymbol{x}^{\prime}\right) \frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\mathbb{E}_{\boldsymbol{x}^{\prime} \sim p_{\theta}}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]$
## Taking gradients in an RBM
For an RBM we must integrate out the latent variables
$
\begin{aligned}
\mathcal{L}(\boldsymbol{\theta}) &=\frac{1}{\mathrm{~N}} \sum_{n} \log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{\boldsymbol{n}}\right)=\frac{1}{\mathrm{~N}} \sum_{n} \log \sum_{\boldsymbol{y}} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right) \\
\frac{\partial}{\partial \boldsymbol{\theta}} \log \sum_{\boldsymbol{v}} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right) &=-\mathbb{E}_{v \sim p_{\theta}\left(v \mid x_{n}\right)}\left[\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right)\right]+\mathbb{E}_{\boldsymbol{x}^{\prime}, \boldsymbol{v} \sim p_{\theta}(x, v)}\left[\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}, \boldsymbol{v}\right)\right]
\end{aligned}
$
And since for $\operatorname{RBM} E_{\theta}(x, v)=-v^{T} W x-b^{T} x-c^{T} v$
$
\begin{array}{c}
\frac{d}{d W_{i j}} E_{\theta}\left(x_{i}, v_{j}\right)=-x_{i} v_{j} \Rightarrow \\
\frac{d \mathcal{L}}{d W_{i j}}=\mathbb{E}_{v \sim p_{\theta}}\left(v \mid x_{n}\right)\left[x_{i} v_{j}\right]-\mathbb{E}_{x^{\prime}, v \sim p_{\theta}(x, v)}\left[x_{i} v_{j}\right]
\end{array}
$
Easy: substitute $x_{n}$ and sum over $v$
Hard (normalization): sum over all $2^{m+d}$ combinations of images & latents
- Intractable due to exponential complexity w.r.t. $m+d$
- Evaluating and optimizing $p_{\theta}(x, v)$ takes a long time
- If we had only the unnormalized part we would have no problem
### Tackling intractability by sampling
$\mathbb{E}_{x^{\prime}, v \sim p_{\theta}(x, v)}\left[\frac{d}{d \theta} E_{\theta}\left(x^{\prime}, v\right)\right]$ stands for an expectation
- One can sample very many $x^{\prime}, v$ from $p_{\theta}(x, v)$
- Take average instead of computing analytically (Monte Carlo sampling)
Question: how to even sample from a hard pdf?
- Markov Chain Monte Carlo with Gibbs sampling
- Convergence after many rounds
### Sampling the normalizaiton constant
We can rewrite the gradient as
$
\frac{d}{\partial \boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})=-\mathbb{E}_{0}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})\right]+\mathbb{E}_{\infty}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]
$
where $\mathbb{E}_{0} \equiv E_{x \sim p_{0}}$ means sampling from training data and average gradients
and $\mathbb{E}_{\infty} \equiv E_{x \sim p_{\theta}(x)}$ means sampling from the model and average gradients
Unfortunately, MCMC can be very slow $\rightarrow 2^{\text {nd }}$ source of intractability.
## Constrastive divergence for RBMs
[[Contrastive Divergence]] approximates gradient by k-steps Gibbs sampler
$
\frac{d}{d \boldsymbol{\theta}} \log p\left(\boldsymbol{x}_{n} \mid \boldsymbol{\theta}\right)=-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}_{0}\right)-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{k}^{\prime}, \boldsymbol{v}_{k}\right)
$
Pushing the nominator up while pushing the denominator down.
![[contrastive-RBMs.jpg]]
### Sampling with Markov Chain Monte Carlo
We want to sample an $x$ from a pdf $p_{\theta}(x)$ with MCMC with Gibbs sampler:
Step $1 .$ Initialize $x^{0}$ randomly
Step 2 . Let $\hat{x}=x^{t}+$ noise
- If negative energy $f_{\theta}(\hat{x})>f_{\theta}\left(x^{\mathrm{t}}\right),$ set $x^{t+1}=\hat{x}$
- Otherwise, we still keep it as $x^{t+1}=x^{t}$ with probability $\frac{p(\hat{x})}{p\left(x^{t}\right)}=\exp \left(f_{\theta}(\hat{x})-f_{\theta}\left(x^{t}\right)\right)$
Go to step 2
Note: Because of the ratio of likelihoods $\rightarrow$ no $Z(\boldsymbol{\theta})$
## Using RBMs
- Some of the first models to show nice generations of images
- Use RBMs to pretrain networks for classification afterward
![[rbm-examples.jpg]]
---
## References