Equivalent Kernel - Notes on AI

# Equivalent Kernel The predictive distribution of [[Bayesian Linear Regression]] is given as, $ \begin{align} p\left(t^{\prime} \mid x^{\prime}, \mathbf{X}, \mathbf{t}, \alpha, \beta\right)&=\int p\left(t^{\prime} \mid x^{\prime}, \mathbf{w}, \beta\right) p(\mathbf{w} \mid \mathbf{X}, \mathbf{t}, \alpha, \beta) \mathrm{d} \mathbf{w} \\ &= \mathcal{N}\left(t^{\prime} \mid \mathbf{m}_{N}^{T} \boldsymbol{\phi}\left(x^{\prime}\right), \sigma_{N}^{2}\left(x^{\prime}\right)\right) \end{align} $ Where, $\mathbf{m}_{N}=\beta \mathbf{S}_{N} \boldsymbol{\Phi}^{T} \mathbf{t} \quad \sigma_{N}^{2}\left(x^{\prime}\right)=\frac{1}{\beta}+\phi\left(x^{\prime}\right)^{T} \mathbf{S}_{N} \phi\left(x^{\prime}\right) \quad \mathbf{S}_{N}^{-1}=\alpha \mathbb{1}+\beta \boldsymbol{\Phi}^{T} \boldsymbol{\Phi}$ The predictive mean can be written in the form: $ \begin{align} y\left(x^{\prime}, \mathbf{m}_{N}\right)&=\phi\left(x^{\prime}\right)^{T} \mathbf{m}_{N} \\ &= \beta \phi\left(x^{\prime}\right)^{T} \mathbf{S}_{N} \boldsymbol{\Phi}^{T} \mathbf{t} \\ &= \sum_{n=1}^{N} \beta \phi(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \phi\left(\mathbf{x}_{n}\right) t_{n} \end{align} $ Thus the mean of the predictive distribution at a point $\mathbf{x}$ is given by a linear combination of the training set target variables $t_n$, so that we can write $ y\left(\mathbf{x}, \mathbf{m}_{N}\right)=\sum_{n=1}^{N} k\left(\mathbf{x}, \mathbf{x}_{n}\right) t_{n} $ where the function $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\beta \boldsymbol{\phi}(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right) $ is known as *equivalent kernel* or *smoother matrix*. For each basis functions, we can define an equivalent kernel that describes how predictions are formed as a linear combination of data points. These equivalent kernels can be thought of as similarity measures with the new datapoint with observed evidence weighted with model parameters. We can also see how equivariant kernels arise by taking the covariance between two predictions. $ \begin{align} \operatorname{cov}\left[t_{1}, t_{2} \mid x_{1}, x_{2}\right]&=\operatorname{cov}_{\mathbf{w}}\left[y\left(x_{1}, \mathbf{w}\right), y\left(x_{2}, \mathbf{w}\right)\right] \\ &= \operatorname{cov}_{\mathbf{w}}\left[\phi\left(x_{1}\right)^{T} \mathbf{w}, \mathbf{w}^{T} \phi\left(x_{2}\right)\right] \\ &= \mathbb{E_\mathbf{w}}[\phi(x_1)^T\mathbf{w}\mathbf{w}^T\phi(x_2)] - \mathbb{E}_\mathbf{w}[\phi(x_1)^T\mathbf{w}]\mathbb{E} [\mathbf{w}^T\phi(x_2)]\\ &= \phi(x_1)^T(\mathbb{E_\mathbf{w}[\mathbf{w}\mathbf{w}^T]} - \mathbb{E}_\mathbf{w}[\mathbf{w}]\mathbb{E}_\mathbf{w}[\mathbf{w}]) \phi(x_2) \\ &= \phi(x_1)^T \text{cov}[\mathbf{w}\mathbf{w}]\phi(x_2) \\ &= \phi(x_1)^T \mathbf{S}_N\phi(x_2) \\ &= \frac{1}{\beta}k(\mathbf{x},\mathbf{x'}) \\ \end{align} $ Thus we can see that the predictive mean at nearby points will be highly correlated. Kernel function also satisfy the following summation constraint: $ \sum_{n=1}^{N} k\left(\mathbf{x}, \mathbf{x}_{n}\right)=1 $ Note that the kernel function value can be negative as well as positive. ---