Kernel Methods - Notes on AI

# Kernel Methods A category of pattern matching algorithms such as nearest neighbors are examples of _memory-based_ methods that involve the storing the entire training set and formulating a _similarity metric_ in order to make predictions for future data points. Many linear parametric models can be re-cast into an equivalent 'dual representation' in which the predictions are also based on linear combinations of a _kernel function_ evaluated at the training data points. $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right) $ The power of kernels come from their ability to be substituted in any models involving the inner product of the feature space. These kernels have no explicitly defined parameters, but implicitly still work with (finite) or infinite number of parameters, making them very powerful. An important contribution to arise from the kernel viewpoint has been the extensions to inputs that are symbolic, rather than simply vectors of real numbers. Kernel functions can be defined over objects as diverse as graphs, sets, strings and text documents. ## Kernelized Ridge Regression The goal is to minimize the sum of squared errors with quadratic weight penalty. $ J(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{\mathbf{w}^{T} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{T} \mathbf{w} $ To obtain the solution, we take the derivative of the error function wrt to the parameters and set it to 0. $ \begin{align} \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}}=\sum^{N}\left\{\mathbf{w}^{T} \phi\left(\mathbf{x}_{n}\right)-{t_{n}}\right\} \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{w}^{\mathcal{T}} \mathbf{I}&=\mathbf{0} \\ \mathbf{w}^{T}\left(\sum_{n=1}^{N} \phi\left(\mathbf{x}_{n}\right) \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{I}\right)&=\sum_{n=1}^{N} t_{n} \phi\left(\mathbf{x}_{n}\right)^{T} \\ \left(\sum_{n=1}^{N} \phi\left(\mathbf{x}_{n}\right) \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{I}\right) \mathbf{w}&=\sum_{n=1}^{N} t_{n} \phi\left(\mathbf{x}_{n}\right) \end{align} $ $ \begin{align} \mathbf{w}&=\left(\sum_{n=1}^{N} \phi\left(\mathbf{x}_{n}\right) \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{I}\right)^{-1} \sum_{n=1}^{N} t_{n} \phi\left(\mathbf{x}_{n}\right) \\ &=\left(\mathbf{\Phi}^{T} \mathbf{\Phi}+\lambda \mathbf{I}\right)^{-1} \mathbf{\Phi}^{T} t \end{align} $ We now define a _Gram_ matrix $\mathbf{K}=\mathbf{\Phi} \Phi^{\mathrm{T}}$, which is a symmetric NxN matrix with elements $ K_{n m}=\phi\left(\mathbf{x}_{n}\right)^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right) $ So, $ \mathbf{w}=\mathbf{\Phi}^{T}\left(\Phi \Phi^{T}+\lambda \mathbf{I}_{N}\right)^{-1} t=\mathbf{\Phi}^{T}\left(\boldsymbol{K} \lambda \mathbf{I}_{N}\right)^{-1} \boldsymbol{t} $ Thus this gives rise to a _dual representation_ in terms of $\mathbf{w}=\mathbf{\Phi}^{T} \mathbf{a}$ where dual variable $\mathbf{a}=\left(\boldsymbol{K}+\lambda \mathbf{I}_{N}\right)^{-1} \boldsymbol{t}$, where our solution is based entirely in terms of kernel function. In this dual formulation, for closed form solutions, inverting of $N\times N$ matrix is required which is of cubic complexity $O(N^3)$, whereas primal form we require inversion of $M\times M$ matrix with complexity $O(M^3)$. So then why is dual representation desired? 1. Allows us to implicitly use feature spaces of high, even infinite, dimensionality. 2. Does not rely on explicit [[Basis Functions]] but on similarity kernel functions. ## The Kernel Trick Formulate your optimization problem in such a way that the input vectors $\mathbf{x}_{n}$ enter only in the form of scalar products: $ \mathbf{x}_{n}^{T} \mathbf{x}_{n} \quad \text { (or when using basis functions } \left.\boldsymbol{\phi}\left(\mathbf{x}_{n}\right)^{T} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right) $ Replace all instances of $\mathbf{x}_{n}^{T} \mathbf{x}_{m}$ with a kernel function $k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)$. In order to exploit this kernel substitution, we need the kernel functions to be _valid kernel function_, which means the Gram Matrix $\mathbf{K}$ must be symmetric positive semi definite for all possible choices of $\left\{\mathbf{x}_{n}\right\}_{n}^{N}$. One approach to constructing valid kernel functions is to choose explicit set of [[Basis Functions]], example: $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{T} \phi\left(\mathbf{x}^{\prime}\right)=\sum_{i=1}^{M} \phi_{i}(\mathbf{x}) \phi_{i}\left(\mathbf{x}^{\prime}\right) $ Or similarly using [[Equivalent Kernel]] with covariance matrix [[Linear Algebra#Eigen Decomposition| eigen decomposition]] : $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{T} \mathbf{\Sigma} \phi\left(\mathbf{x}^{\prime}\right)=\psi(\mathbf{x})^{T} \psi\left(\mathbf{x}^{\prime}\right)=\sum_{i=1}^{M} \psi_{i}(\mathbf{x}) \psi_{i}\left(\mathbf{x}^{\prime}\right) $ For every positive definite kernel there exists $\phi: \mathbb{R}^{d} \rightarrow \mathbb{R}^{M}$ such that $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{T} \phi(\mathbf{x}) $ For example, in case of two dimensional input space $\mathbf{x}=\left(x_{1}, x_{2}\right)$, we can identify the nonlinear feature mappings as $ \begin{aligned} k(\mathbf{x}, \mathbf{z}) &=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}=\left(x_{1} z_{1}+x_{2} z_{2}\right)^{2} \\ &=x_{1}^{2} z_{1}^{2}+2 x_{1} z_{1} x_{2} z_{2}+x_{2}^{2} z_{2}^{2} \\ &=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)\left(z_{1}^{2}, \sqrt{2} z_{1} z_{2}, z_{2}^{2}\right)^{\mathrm{T}} \\ &=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{z}) \end{aligned} $ Therefore, we always know that there is a nonlinear mapping function in a valid kernel, but we don't need it know what it is necessarily. In general it is difficult to retrieve the corresponding $\phi(\mathbf{x})$ for a given kernel. ## Constructing Kernels from other kernels $ \begin{aligned} k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=c k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=f(\mathbf{x}) k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) f\left(\mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=q\left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\exp \left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)+k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{3}\left(\boldsymbol{\phi}(\mathbf{x}), \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\mathbf{x}^{\mathrm{T}} \mathbf{A} \mathbf{x}^{\prime} \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right)+k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right) k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \end{aligned} $ ## Examples ### Generalized polynomial kernel $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\left(c+\mathbf{x}^{T} \mathbf{x}^{\prime}\right)^{M} $ ### Gaussian kernel Also known as squared exponential kernel: infinite dimensional space! $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\frac{1}{2 \sigma^{2}}|| \mathbf{x}-\mathbf{x}^{\prime}||^{2}\right) $ ### Radial basis functions RBFs are a general class of _homogenous_ kernels that depend only on the magnitude of the distance (often used synonymously with Gaussian Kernel) $ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=k\left(\| \mathbf{x}-\mathbf{x}^{\prime}||^{2}\right) $ ## Pros and Cons Pros: - We have no explicit parameters/features anymore, only implicit by the kernel function $k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)$ - No need to handpick locations of basis functions Cons: - The computational cost to retrieve $\alpha$ is $\mathcal{O}\left(N^3\right)$ as $\boldsymbol{K} \in \mathbb{R}^{N \times N}$ compared to $\mathcal{O}\left(M^3\right)$ for calculating $M$ on the standard way (the cost comes from the inverse) - During prediction, we need $\mathcal{O}(N \cdot M)$ to compute the output for a new point, but would only need $\mathcal{O}(N)$ with the primal parameters $\boldsymbol{w} \Rightarrow$ slow prediction for large datasets --- ## References 1. 6.0-6.2, Bishop 2006