Energy based models - Notes on AI

# Energy based models Distribution as: $p_{\boldsymbol{\theta}}(\boldsymbol{x})=\frac{1}{\int_{\boldsymbol{x}} g_{\boldsymbol{\theta}}(\boldsymbol{x}) d x} g_{\boldsymbol{\theta}}(\boldsymbol{x})$ $p_{\theta}$ as known probability distributions (Gaussian, exp.) can be restrictive. Maybe we want to encode domain knowledge of how variables interact. We can also define an energy function and divide by its volume $ g_{\theta}(x)=\exp \left(f_{\theta}(x)\right) \Rightarrow p_{\theta}(x)=\frac{1}{Z(\theta)} \exp \left(f_{\theta}(x)\right) $ - $- f_{\theta}(x)$ is the energy function - Partition function is the hard bit $ Z(\boldsymbol{\theta})=\int_{x} \exp \left(f_{\boldsymbol{\theta}}(\boldsymbol{x})\right) d \boldsymbol{x} $ - Note the multi-dimensional integral due to $x$ Why exponential and not square? - Couples well with maximum likelihood and natural logarithms - Many existing distributions are exponential-based - They arise often in statistical physics -> Good inspiration ## Advantages and disadvantages - Very flexible in defining our energy function - Sampling from $p_{\theta}(x)$ can be very hard - The CDF introduces another integral - Evaluating and optimizing likelihood can be hard -> Learning is hard Must be able to compute the partition function - In vanilla case no latent variables -> no representation learning - Latent variables can be added though ## Ratio of likelihoods - The partition function is often very hard to compute analytically - But if we have pairs of inputs $ \left.\frac{p_{\theta}\left(x_{a}\right)}{p_{\theta}\left(x_{b}\right)}=\exp \left(f_{\theta}\left(x_{a}\right)-f_{\theta}\left(x_{b}\right)\right)\right. $ - No partition function anymore ## Examples Ising model $ p_{\boldsymbol{\theta}}(\boldsymbol{y}, \boldsymbol{x})=\frac{1}{Z} \exp \left(\sum_{i} \psi_{i}\left(x_{i}, y_{i}\right)+\sum_{i, j \in E} \psi_{i j}\left(y_{i}, y_{j}\right)\right) $ Product of experts (similar to AND) $ p_{\boldsymbol{\theta}}(\boldsymbol{x})=\frac{1}{Z(\boldsymbol{\theta}, \boldsymbol{\varphi}, \boldsymbol{\omega})} q_{\theta}(\boldsymbol{x}) r_{\varphi}(\boldsymbol{x}) s_{\omega}(\boldsymbol{x}) $ [[Hopfield Networks]] [[Boltzmann Machines]] [[Deep Belief Networks]] ## Applications Given trained model - [[Anomaly Detection]] - Denoising & restoration - Classification --- ## References