# Boltzmann Machines [[Hopfield Networks]] minimize the quadratic energy function $ E=-f_{\theta}(x)=-\left(\sum_{i, j} w_{i j} x_{i} x_{j}+\sum_{i} b_{i} x_{i}\right) $ Boltzmann machines are stochastic Hopfield networks. In Boltzmann machines the neuron response on activation $a_{i}$ is $ x_{i}=\left\{\begin{array}{l}+1 \text { with probability } 1 / 1+\exp \left(-2 a_{i}\right) \\ -1 \quad \text { otherwise }\end{array}\right. $ ## Restricted Boltzmann Machines We can interpret Boltzmann netowrks implement gibbs sampling for pdf $p(x)=\frac{1}{z} \exp \left(\frac{1}{2} x^{T} W x\right)$ Boltzmann machines are too parameter heavy For $x$ with $256 \times 256=65536$ the $W$ has 4.2 billion parameters. Boltzmann machines learn no features. Instead, it add bottleneck latents $v$: $ E=-f_{\theta}(x)=-\left(\sum_{i, j} w_{i j} x_{i} v_{j}+\sum_{i} b_{i} x_{i}+\sum_{j} c_{j} v_{j}\right) $ $x_{i}$ and $v_{j}$ are still binary variables in the original model. - The quadratic term captures correlations - The unary terms capture priors: how likely is a (latent) pixel to be +1 or -1 Energy function: $E(x)=-x^{T} W v-b^{T} x-c^{T} v$ $ p(x)=\frac{1}{Z} \sum_{v} \exp (-E(x, v)) $ Not in the form $\propto \exp (\mathrm{x}) / \mathrm{Z}$ because of the $\Sigma$ Rewriting in form of Free energy function: $F(x)=-\boldsymbol{b}^{T} \boldsymbol{x}-\sum_{i} \log \sum_{v_{i}} \exp \left(v_{i}\left(c_{i}+\boldsymbol{W}_{i} \boldsymbol{x}\right)\right)$ $ \begin{array}{l} p(x)=\frac{1}{Z} \exp (-F(x)) \\ Z=\sum_{x} \exp (-F(x)) \end{array} $ The $F(x)$ defines a bipartite graph with undirected connections. Information flows forward and backward. ![[rbms.jpg]] The hidden variables $v_{j}$ are independent conditioned on the visible variables $ p(v \mid x)=\prod_{j} p\left(v_{j} \mid x, \boldsymbol{\theta}\right) $ The visible variables $x_{i}$ are independent conditioned on the hidden variables $ p(\boldsymbol{x} \mid \boldsymbol{v})=\prod_{i} p\left(x_{i} \mid \boldsymbol{v}, \boldsymbol{\theta}\right) $ ## Training RBM conditional probabilities The conditional probabilities are defined as sigmoids $ \begin{array}{l} p\left(v_{j} \mid x, \boldsymbol{\theta}\right)=\sigma\left(\boldsymbol{W}_{\cdot j} \boldsymbol{x}+b_{j}\right) \\ p\left(x_{i} \mid \boldsymbol{v}, \boldsymbol{\theta}\right)=\sigma\left(\boldsymbol{v}^{T} \boldsymbol{W}_{i}+c_{i}\right) \end{array} $ Since RBMs are bidirectional -> "Loop" between visible and latent, going back and forth and again $ \begin{array}{l} v^{(1)} \sim \sigma\left(W_{\cdot j} x^{(0)}+b_{j}\right) \Rightarrow \\ x^{(1)} \sim \sigma\left(W_{\cdot j} v^{(2)}+b_{j}\right) \Rightarrow \\ v^{(2)} \sim \sigma\left(W_{\cdot j} x^{(1)}+b_{j}\right) \Rightarrow \ldots \end{array} $ ## Training any energy model Maximizing log-likelihood $ \mathcal{L}(\boldsymbol{\theta})=\frac{1}{\mathrm{~N}} \sum_{n} \log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}\right)=\mathbb{E}_{p_{0}}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right] $ The expectation w.r.t. a pdf is equivalent to - sampling from the pdf and - then taking the average $ \mathbb{E}_{x \sim p_{0}}[\log p(x \mid \theta)]=\mathbb{E}_{x \sim p_{0}}\left[-E_{\theta}(x)\right]-\log Z(\theta) $ - where $\log Z(\boldsymbol{\theta})=\log \sum_{x^{\prime}} \exp \left(-E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$ - and $p_{0}(x)$ is the data distribution ## Taking gradients of any energy model $\frac{d}{d \boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\boldsymbol{x})=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\frac{d}{d \boldsymbol{\theta}} \log Z(\boldsymbol{\theta})$ $=-\frac{d}{\partial \boldsymbol{\theta}} E_{\theta}(\boldsymbol{x})-\frac{1}{Z(\boldsymbol{\theta})} \frac{d}{d \boldsymbol{\theta}} Z(\boldsymbol{\theta})$ $=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\sum_{x^{\prime}} \frac{1}{Z(\boldsymbol{\theta})} \exp \left(-E_{\boldsymbol{\theta}}\left[\boldsymbol{x}^{\prime}\right]\right)\left(-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$ $=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\sum_{\boldsymbol{x}^{\prime}} p_{\theta}\left(\boldsymbol{x}^{\prime}\right) \frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)$ $=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\mathbb{E}_{\boldsymbol{x}^{\prime} \sim p_{\theta}}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]$ ## Taking gradients in an RBM For an RBM we must integrate out the latent variables $ \begin{aligned} \mathcal{L}(\boldsymbol{\theta}) &=\frac{1}{\mathrm{~N}} \sum_{n} \log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{\boldsymbol{n}}\right)=\frac{1}{\mathrm{~N}} \sum_{n} \log \sum_{\boldsymbol{y}} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right) \\ \frac{\partial}{\partial \boldsymbol{\theta}} \log \sum_{\boldsymbol{v}} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right) &=-\mathbb{E}_{v \sim p_{\theta}\left(v \mid x_{n}\right)}\left[\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right)\right]+\mathbb{E}_{\boldsymbol{x}^{\prime}, \boldsymbol{v} \sim p_{\theta}(x, v)}\left[\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}, \boldsymbol{v}\right)\right] \end{aligned} $ And since for $\operatorname{RBM} E_{\theta}(x, v)=-v^{T} W x-b^{T} x-c^{T} v$ $ \begin{array}{c} \frac{d}{d W_{i j}} E_{\theta}\left(x_{i}, v_{j}\right)=-x_{i} v_{j} \Rightarrow \\ \frac{d \mathcal{L}}{d W_{i j}}=\mathbb{E}_{v \sim p_{\theta}}\left(v \mid x_{n}\right)\left[x_{i} v_{j}\right]-\mathbb{E}_{x^{\prime}, v \sim p_{\theta}(x, v)}\left[x_{i} v_{j}\right] \end{array} $ Easy: substitute $x_{n}$ and sum over $v$ Hard (normalization): sum over all $2^{m+d}$ combinations of images & latents - Intractable due to exponential complexity w.r.t. $m+d$ - Evaluating and optimizing $p_{\theta}(x, v)$ takes a long time - If we had only the unnormalized part we would have no problem ### Tackling intractability by sampling $\mathbb{E}_{x^{\prime}, v \sim p_{\theta}(x, v)}\left[\frac{d}{d \theta} E_{\theta}\left(x^{\prime}, v\right)\right]$ stands for an expectation - One can sample very many $x^{\prime}, v$ from $p_{\theta}(x, v)$ - Take average instead of computing analytically (Monte Carlo sampling) Question: how to even sample from a hard pdf? - Markov Chain Monte Carlo with Gibbs sampling - Convergence after many rounds ### Sampling the normalizaiton constant We can rewrite the gradient as $ \frac{d}{\partial \boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})=-\mathbb{E}_{0}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})\right]+\mathbb{E}_{\infty}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right] $ where $\mathbb{E}_{0} \equiv E_{x \sim p_{0}}$ means sampling from training data and average gradients and $\mathbb{E}_{\infty} \equiv E_{x \sim p_{\theta}(x)}$ means sampling from the model and average gradients Unfortunately, MCMC can be very slow $\rightarrow 2^{\text {nd }}$ source of intractability. ## Constrastive divergence for RBMs [[Contrastive Divergence]] approximates gradient by k-steps Gibbs sampler $ \frac{d}{d \boldsymbol{\theta}} \log p\left(\boldsymbol{x}_{n} \mid \boldsymbol{\theta}\right)=-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}_{0}\right)-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{k}^{\prime}, \boldsymbol{v}_{k}\right) $ Pushing the nominator up while pushing the denominator down. ![[contrastive-RBMs.jpg]] ### Sampling with Markov Chain Monte Carlo We want to sample an $x$ from a pdf $p_{\theta}(x)$ with MCMC with Gibbs sampler: Step $1 .$ Initialize $x^{0}$ randomly Step 2 . Let $\hat{x}=x^{t}+$ noise - If negative energy $f_{\theta}(\hat{x})>f_{\theta}\left(x^{\mathrm{t}}\right),$ set $x^{t+1}=\hat{x}$ - Otherwise, we still keep it as $x^{t+1}=x^{t}$ with probability $\frac{p(\hat{x})}{p\left(x^{t}\right)}=\exp \left(f_{\theta}(\hat{x})-f_{\theta}\left(x^{t}\right)\right)$ Go to step 2 Note: Because of the ratio of likelihoods $\rightarrow$ no $Z(\boldsymbol{\theta})$ ## Using RBMs - Some of the first models to show nice generations of images - Use RBMs to pretrain networks for classification afterward ![[rbm-examples.jpg]] --- ## References