Gaussian Processes - Notes on AI

# Gaussian Processes A Gaussian Process is a distribution over functions that lets us do Bayesian inference directly on function space: $f(\cdot) \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot))$ **Intuition:** Think of it as having a "prior belief" about what functions look like. The covariance function $k(x, x')$ encodes our assumptions—if it's large when $x \approx x'$, we believe the function is smooth; periodic kernels encode repeating patterns, etc. For any input $x$, $f(x)$ is a random variable with $\mathbb{E}[f(x)] = m(x)$ and $\text{cov}(f(x), f(x')) = k(x, x')$. When we observe data points, we condition the GP to get a posterior distribution over functions that (1) fits our observations and (2) maintains our structural assumptions everywhere else. **Why it's powerful:** Unlike parametric models, we don't assume a fixed functional form. Instead, we get a full probability distribution over functions, providing both predictions and uncertainty estimates—crucial for decision-making when data is scarce. Some interesting applications: **Hyperparameter Optimization**: See below. **Spatial modeling (Kriging):** Predicting rainfall/temperature at unobserved locations using nearby weather stations. The covariance function encodes "closer locations have more similar weather." **Active learning:** Deciding which data points to label next. GP uncertainty guides where more labels would be most informative—crucial when labeling is expensive. **Robotics:** Sensor fusion and control. A robot arm learning to pick up objects uses GPs to model how different grasping strategies work, with uncertainty helping avoid dangerous actions. **Medical diagnosis:** Modeling patient response to treatments. With limited patient data, GPs provide both treatment recommendations and confidence intervals for safety. **Computer experiments:** When simulations are expensive (climate models, fluid dynamics), GPs create fast "surrogate models" that approximate the simulation while tracking where the approximation is unreliable. **Common theme:** GPs excel whenever you have expensive function evaluations, need uncertainty estimates, or want to incorporate domain knowledge through kernel design. ## Example: Hyperparameter Optimization **Step 1:** You want to optimize learning rate (lr) and batch size (bs) for a neural network. The "black box" function is: f(lr, bs) = validation accuracy after training. **Step 2:** Start with 3-5 random evaluations: - f(0.01, 32) = 0.85 - f(0.1, 64) = 0.78 - f(0.001, 128) = 0.82 **Step 3:** Fit GP to this data. The GP now has a posterior belief about the function shape, with high uncertainty in unexplored regions. **Step 4:** Use an **acquisition function** to pick the next point: **Upper Confidence Bound (UCB):** μ(x) + β·σ(x) - **The magic:** σ(x) automatically shrinks near observed points but stays large in unexplored regions. Early on, high σ(x) dominates → exploration. As you gather data, μ(x) becomes more reliable → exploitation. The β parameter controls this tradeoff. **Expected Improvement (EI):** More sophisticated. If current best is f_best = 0.85: $\text{EI}(x) = \sigma(x) \cdot \left[ Z \cdot \Phi(Z) + \phi(Z) \right]$ where $Z = \frac{\mu(x) - f_{best}}{\sigma(x)}$, and Φ,φ are normal CDF/PDF. **Intuition:** EI asks "how much better than 0.85 do I expect this point to be?" It's zero if μ(x) ≤ 0.85 with low uncertainty, but positive if there's a decent chance of beating the current best. **Step 5:** Evaluate the chosen point, update GP. **Step 6:** Repeat. The acquisition function automatically shifts from exploration to exploitation as uncertainty decreases. This typically finds near-optimal hyperparameters in 20-50 evaluations vs. thousands needed for random search. ## Functions as vectors Think of a function $f(\cdot)$ drawn from a $G P$ as an extremely high-dimensional vector drawn from an extremely high-dimensional multivariate Gaussian distribution. Each dimension corresponds to an element $\mathbf{x} \in \mathbb{R}^{n}$. Each entry of the vector is $a f(\mathbf{x})$ for a particular $\mathbf{x} \in \mathbb{R}^{n}$. Sampling from a GP then takes the following form: 1. Sample input points $\mathbf{x} \in \mathbb{R}^{n}$ 2. Construct the Gram matrix $\boldsymbol{K}$ for all sampled $\mathbf{x}$. 3. Sample vector : $ \left.\left[\begin{array}{c} f\left(\mathbf{x}_{1}\right) \\ \vdots \\ f\left(\mathbf{x}_{N}\right) \end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{c} m\left(\mathbf{x}_{1}\right) \\ \vdots \\ m\left(\mathbf{x}_{N}\right) \end{array}\right]\right),\left[\begin{array}{ccc} k\left(\mathbf{x}_{1}, \mathbf{x}_{1}\right) & \ldots & k\left(\mathbf{x}_{1}, \mathbf{x}_{N}\right) \\ \vdots & \ddots & \vdots \\ k\left(\mathbf{x}_{N}, \mathbf{x}_{1}\right) & \ldots & k\left(\mathbf{x}_{N}, \mathbf{x}_{N}\right) \end{array}\right]\right) $ Specifying a kernel determines the characteristics over functions drawn from the $GP$. For simplicity let's take mean vector as 0 $ \mathbf{f}=\left[\begin{array}{c} f\left(\mathbf{x}_{1}\right) \\ \vdots \\ f\left(\mathbf{x}_{N}\right) \end{array}\right] \sim \mathcal{N}\left(\mathbf{0},\left[\begin{array}{ccc} k\left(\mathbf{x}_{1}, \mathbf{x}_{1}\right) & \ldots & k\left(\mathbf{x}_{1}, \mathbf{x}_{N}\right) \\ \vdots & \ddots & \vdots \\ k\left(\mathbf{x}_{N}, \mathbf{x}_{1}\right) & \ldots & k\left(\mathbf{x}_{N}, \mathbf{x}_{N}\right) \end{array}\right]\right) $ One widely used kernel function for GPs is given by the exponential of a quadratic form, with addition of constant and linear terms to give $ k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)=\theta_{0} \exp \left\{-\frac{\theta_{1}}{2}\left\|\mathbf{x}_{n}-\mathbf{x}_{m}\right\|^{2}\right\}+\theta_{2}+\theta_{3} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} $ Note that the $\theta$ are hyperparameters of the kernel. Then to sample functions from this Gaussian Process: Sample fine grid of points $\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\} \in[-5,5]$. Then compute $\mathbf{K}$. Then find the decomposition $\mathbf{K}=\mathbf{L} \mathbf{L}^{T}$. Then sample a random vector of size $\mathrm{N}: \quad \mathrm{z} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_{N}\right)$ and sample $\mathbf{f}=\left[\begin{array}{c}f\left(\mathbf{x}_{1}\right) \\ \vdots \\ f\left(\mathbf{x}_{N}\right)\end{array}\right] \sim \mathcal{N}(\mathbf{0}, \mathbf{K})$ by computing $\mathbf{f}=\mathbf{L} \mathbf{z}$ (recall that this is [[Gaussian Distribution#Reparameterization Trick|reparameterization trick]]). Varying the pre-factor $\theta_0$, we see that it governs the amplitude of the sampled functions. ![[theta0.jpg]] Verying the length scale $\theta_1$, we can see it governs the length/smoothness of the values. ![[theta1.jpg]] Varying the offset $\theta_2$, we can see it governs the position of the samples. ![[theta2.jpg]] Varying the linear term $\theta_3$, we can see that it governs the slope of the function. ![[theta3.jpg]] ## Regression with GPs Recall that [[Bayesian Linear Regression]]: 1. Suffers from limited expressiveness 2. Can become more expressive for larger $M,$ where $\phi(\mathbf{x}) \in \mathbb{R}^{M}$, but this also requires more parameters! $\left(\mathbf{w} \in \mathbb{R}^{M}\right)$ 3. We also need to choose the basis functions at "good" locations We know that the final mean prediction $\mu_{N}\left(\mathbf{x}^{*}\right)=\phi\left(\mathbf{x}^{*}\right)^{T} \mathbf{m}_{N}=\beta \boldsymbol{\phi}\left(\mathbf{x}^{*}\right)^{T} \mathbf{S}_{N}^{-1} \mathbf{\Phi}^{T} \mathbf{t}$ can be written in [[Equivalent Kernel]] form. This allows us to dispense with the parametric model and instead define a prior probability distribution over functions directly with Gaussian processes. How can we define this with a GP? Take any finite set $\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right\}$ with corresponding random variables $\left\{f\left(\mathbf{x}_{1}\right), \ldots, f\left(\mathbf{x}_{N}\right)\right\}$ then $ \left.p\left(\left[\begin{array}{c} f\left(\mathbf{x}_{1}\right) \\ \vdots \\ f\left(\mathbf{x}_{N}\right) \end{array}\right]\right)=\mathcal{N}\left(\left[\begin{array}{c} m\left(\mathbf{x}_{1}\right) \\ \vdots \\ m\left(\mathbf{x}_{N}\right) \end{array}\right]\right),\left[\begin{array}{ccc} k\left(\mathbf{x}_{1}, \mathbf{x}_{1}\right) & \ldots & k\left(\mathbf{x}_{1}, \mathbf{x}_{N}\right) \\ \vdots & \ddots & \vdots \\ k\left(\mathbf{x}_{N}, \mathbf{x}_{1}\right) & \ldots & k\left(\mathbf{x}_{N}, \mathbf{x}_{N}\right) \end{array}\right]\right) $ Consistency requirement: any subset of $\left\{f\left(\mathbf{x}_{1}\right), \ldots, f\left(\mathbf{x}_{N}\right)\right\}$ should also be Gaussian distributed. But it works out because of the [[Gaussian Distribution#Marginalization property|marginalization property]]. We have observed $\left\{\left(\mathbf{x}_{i}, f_{i}\right)\right\}_{i=1}^{N}$ where we assume $ f_{i}=f\left(\mathbf{x}_{i}\right)=y\left(\mathbf{x}_{i}\right)+\varepsilon, \quad \varepsilon \sim \mathcal{N}\left(0, \beta^{-1}\right) $ Because the sum of independent random gaussian variables is also Gaussian, $ \mathbf{f} \sim \mathcal{N}\left(\mathbf{0}, K(\mathbf{X}, \mathbf{X})+\beta^{-1} \boldsymbol{I}\right) $ The joint distribution of test points $\mathbf{f}^{*}$ (at $\mathbf{X}^{*}$ ) and $\mathbf{f}$ (train points), according to our $G P$, is given by $ \left[\begin{array}{c} \mathbf{f} \\ \mathbf{f}^{*} \end{array}\right] \sim \mathcal{N}\left(\mathbf{0},\left[\begin{array}{cc} \mathbf{K}(\mathbf{X}, \mathbf{X})+\beta^{-1} \mathbf{I} & \mathbf{K}\left(\mathbf{X}, \mathbf{X}^{*}\right) \\ \mathbf{K}\left(\mathbf{X}^{*}, \mathbf{X}\right) & \mathbf{K}\left(\mathbf{X}^{*}, \mathbf{X}^{*}\right)+\beta^{-1} \mathbf{I} \end{array}\right]\right) $ From [[Gaussian Distribution#Conditioning property|conditionining property]], we can directly obtain posterior distribution for what function values look like $\mathbf{f}*$ given the training points $\mathbf{f}$. $ \left.p\left(\mathbf{f}^{*} \mid \mathbf{X}^{*}, \mathbf{X}, \mathbf{f}\right)=\mathcal{N}\left(\boldsymbol{\mu}^{*}, \mathbf{\Sigma}^{*}\right)\right) $ with $ \begin{array}{l} \boldsymbol{\mu}^{*}=\mathbf{K}\left(\mathbf{X}^{*}, \mathbf{X}\right)\left(\mathbf{K}(\mathbf{X}, \mathbf{X})+\beta^{-1} \mathbf{I}\right)^{-1} \mathbf{f} \\ \mathbf{\Sigma}^{*}=\mathbf{K}\left(\mathbf{X}^{*}, \mathbf{X}^{*}\right)+\beta^{-1} \mathbf{I}-\mathbf{K}\left(\mathbf{X}^{*}, \mathbf{X}\right)\left(\mathbf{K}(\mathbf{X}, \mathbf{X})+\beta^{-1} \mathbf{I}\right)^{-1} \mathbf{K}\left(\mathbf{X}, \mathbf{X}^{*}\right) \end{array} $ A nice consequence of Gaussian Process is that it gives rise naturally to active learning i.e. to identify the regions with uncertainity and gather more data for such regions. ![[GP.jpg]] ### Choosing the kernel parameters The prediction of a Gaussian process model depends on the choice of the covariance function i.e. the kernel. A simple approach with which these hyperparameters can be learned is with maximum likelihood estimate. Take training observations for which we know $ \begin{aligned} &\mathbf{f} \sim \mathcal{N}\left(\mathbf{0}, C_{\ell} \mathbf{X}, \mathbf{X}\right)=\frac{1}{(2 \pi)^{N / 2} \mid C_{}^{1 / 2}} \exp \left(-\frac{1}{2} \mathbf{f}^{T} \mathbf{C}_{{}}^{-1} \mathbf{f}\right)\\ &\text { with } C_{}(\mathbf{X}, \mathbf{X})=K(\mathbf{X}, \mathbf{X})+\beta^{-1} \boldsymbol{I} \end{aligned} $ The maximum likelihood is then given by standard form for a multivariate Gaussian distribution, giving, $ \ln p(\mathbf{t} \mid \boldsymbol{\theta})=-\frac{1}{2} \ln \left|\mathbf{C}_{N}\right|-\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \mathbf{t}-\frac{N}{2} \ln (2 \pi) $ and the solution $ \frac{\partial}{\partial \theta_{i}} \ln p(\mathbf{t} \mid \boldsymbol{\theta})=-\frac{1}{2} \operatorname{Tr}\left(\mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}}\right)+\frac{1}{2} \mathbf{t}^{\mathrm{T}} \mathbf{C}_{N}^{-1} \frac{\partial \mathbf{C}_{N}}{\partial \theta_{i}} \mathbf{C}_{N}^{-1} \mathbf{t} $ This in general is a nonconvex function and can have multiple maximal. The solutions can be obtained numerically. ## Intuition In Gaussian processes it is often assumed that $\mu=0$, which simplifies the necessary equations for conditioning. We can always assume such a distribution, even if $\mu \neq 0$, and add $\mu$ back to the resulting function values after the prediction step. The clever step of Gaussian processes is how we set up the covariance matrix $\Sigma$. The covariance matrix will not only describe the shape of our distribution, but ultimately determines the characteristics of the function that we want to predict. We generate the covariance matrix by evaluating the kernel $k$, which is often also called covariance function, pairwise on all the points. The kernel receives two points $t, t^{\prime} \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar: $ k: \mathbb{R}^n \times \mathbb{R}^n \rightarrow \mathbb{R}, \quad \Sigma=\operatorname{Cov}\left(X, X^{\prime}\right)=k\left(t, t^{\prime}\right) $ Kernels are widely used in machine learning, for example in support vector machines . The reason for this is that they allow similarity measures that go far beyond the standard euclidean distance ( $L 2$-distance). Many of these kernels conceptually embed the input points into a higher dimensional space in which they then measure the similarity. If we have not yet observed any training examples, this distribution revolves around $\mu=0$, according to our original assumption. The prior distribution will have the same dimensionality as the number of test points $N=|X|$. We will use the kernel to set up the covariance matrix, which has the dimensions $N \times N$. So what happens if we observe training data? Let's get back to the model of Bayesian inference, which states that we can incorporate this additional information into our model, yielding the posterior distribution $P_{X \mid Y}$. ## References 1. 6.4 Bishop, 2006 2. https://distill.pub/2019/visual-exploration-gaussian-processes 3. https://thegradient.pub/gaussian-process-not-quite-for-dummies/