Support Vector Machines (SVM)

# Support Vector Machines Learning algorithms based on [[Kernel Methods]] have a significant limitation such that for all the input point pairs, kernel function $k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)$ must be evaluated, which is computationally infeasible for both training and inference. SVMs are kernel based method that have _sparse_ solutions, i.e. their predictions for new inputs depend only on the kernel function evaluated at a subset of training data points. Another important property of SVM is that the determination of model parameters is a convex optimization problem, so any local solution is a global optimum. Most ML problems (neural networks, logistic regression with complex features) are non-convex - they have multiple local minima that might not be globally optimal. **With SVMs, any solution you find is THE best solution possible. No getting stuck in bad local minima!** This is why SVMs were so revolutionary in the 90s/2000s - they offered strong theoretical guarantees that were rare in machine learning at the time. However, SVMs are not probabilistic models and provide no uncertainty estimates. (See [[Gaussian Processes]] for that). ## Maximum Margin Classifier Generally, we have seen that there are multiple solutions all of which classify the training data set exactly. In such case, we should try to find the one that give the smallest generalization error. Support vector machines approach this problem with the concept of _margin_, which is the smallest distance between the decision boundary and any of the samples. The goal of the algorithm then is to maximize the margin. Assume a point x and it's hyperplane defined as y(x)=0. Then the perpendicular distance of a point x from the hyperplane is given as $\frac{|y(\mathbf{x})|}{\|\mathbf{w}\|}$. Thus the distance of a point $\mathbf{x}_n$ to the decision surface is given by $ \frac{t_{n} y\left(\mathbf{x}_{n}\right)}{\|\mathbf{w}\|}=\frac{t_{n}\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)+b\right)}{\|\mathbf{w}\|} $ And thus, maximum margin solution is given by solving $ \underset{\mathbf{w}, b}{\arg \max }\left\{\frac{1}{\|\mathbf{w}\|} \min _{n}\left[t_{n}\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)+b\right)\right]\right\} $ Direct solution of this optimization problem is very complex, so we can convert it to an equivalent problem by setting $t_{n}\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)+b\right)=1$ for the nearest point. Then the size of the margin is given by $\frac{1}{\|\mathbf{w}\|}$. And for all data points we will have $t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right) \geq 1$. Then the optimization simply boils down to $ \underset{\mathbf{w}, b}{\arg \min } \frac{1}{2}\|\mathbf{w}\|^{2} \text { subject to } N \text { constraints } t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right) \geq 1 $ This is an example of _quadratic programming_ problem in which we are trying to minimize a quadratic function to a set of linear inequality constraint. We can solve this constrained optimization problem using [[Lagrange Multipliers#Inequality constraints optimization|lagrange multipliers]], giving the Lagrangian function $ L(\mathbf{w}, b, \mathbf{a})=\frac{1}{2}\|\mathbf{w}\|^{2}-\sum^{N} a_{n}\left\{t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right)-1\right\} $ where $f(\mathbf{w}) = \frac{1}{2}\|\mathbf{w}\|^{2}$, $g(\mathbf{w},\mathbf{x}, b) = \left(t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right)-1\right)$ and $a_n$ for $n=1, \ldots, N$. Note the minum sign in front of Lagrange multiplier is because we are minimizing with respect to $\mathbf{w}$ and b, and maximizing with respect to $\mathbf{a}$. Recall the KKT conditions: $ \begin{aligned} &\begin{array}{llll} \text { (primal feasibility) } & & t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right)-1 \geq 0 & \text { for } & n=1, \ldots, N \\ \text { (dual feasibility) } & a_{n} \geq 0 & \text { for } n=1, \ldots, N \end{array}\\ & \text { (complimentary slackness) } a_{n}\left(t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right)-1\right)=0 \quad \text { for } \quad n=1, \ldots, N \end{aligned} $ Obtaining the stationary conditions, we get, $ \frac{\partial L}{\partial \mathbf{w}}=\mathbf{w}^{T}-\sum_{n=1}^{N} a_{n} t_{n} \mathbf{x}_{n}^{T}=0 \quad \rightarrow \quad \mathbf{w}=\sum_{n=1}^{N} a_{n} t_{n} \mathbf{x}_{n} $ $ \frac{\partial L}{\partial b}=-\sum_{n=1}^{N} a_{n} t_{n}=0 \quad \rightarrow \quad \sum_{n=1}^{N} a_{n} t_{n}=0 $ Now using these new constraints and KKT constraints, we can get the dual representation of Langrangian by eliminating $\mathbf{w}$ and $b$: $ \begin{align} \tilde{L}(\mathbf{a})&={w}^{\top}\left(\frac{1}{2} {w}-\left(\sum_{n=1}^{N} a_{n} t_{n} x_{n}\right)\right)-\sum_{n=1}^{N} a_{n} t_{n} b+\sum_{n=1}^{N} a_{n}\\ &= -\frac{1}{2} {w}^{\top} {w}-b \sum_{n=1}^{N} a_{n} t_{n}+\sum_{n=1}^{N} a_{n} \\ &= \sum_{n=1}^{N} a_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} a_{n} a_{m} t_{n} t_{m} \mathbf{x}_{n}^{T} \mathbf{x}_{m} \\ \end{align} $ with respect to $\mathbf{a}$ subject to constraints $a_{n} \geq 0 \text { for } n=1, \ldots, N$ and $\sum_{n=1}^{N} a_{n} t_{n}=0$. Applying the [[Kernel Methods#The Kernel Trick|kernel trick]] by replacing $\mathbf{x}_{n}^{T} \mathbf{x}_{m}$ with $k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)$ we get, $ \tilde{L}(\mathbf{a})=\sum_{n=1}^{N} a_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} a_{n} a_{m} t_{n} t_{m} k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right) $ The prediction of class for datapoint $\mathbf{x}_n$ is then given as: $ y(\mathbf{x})=\sum_{n=1}^{N} a_{n} t_{n} k\left(\mathbf{x}_{n}, \mathbf{x}\right)+b $ From KKT conditions above, we know that for every data point, either $a_{n}=0$ or $t_{n} y\left(\mathbf{x}_{n}\right)=1$. Data points for which $a_{n}=0$ will not be used in the prediction. The remaining data points lie on the maximum margin hyperplanes and are called _support vectors_. $ \begin{array}{ll} a_{n}>0 \rightarrow t_{n} y\left(\mathbf{x}_{n}\right)=1 & \text { (support vectors) } \\ a_{n}=0 \quad \leftarrow \quad t_{n} y\left(\mathbf{x}_{n}\right)>1 & \text { (all other points) } \end{array} $ After solving for value of $\mathbf{a}$, we can determine the bias term by noting that for any support vector $\mathbf{x}_n$ i.e. $t_{n} y_{n}(\mathbf{x})=1$ gives us, $ \begin{align} t_{n}\left(\sum_{m \in S} a_{m} t_{m} k\left(\mathbf{x}_{m}, \mathbf{x}_{n}\right)+b\right)&=1\\ \sum_{m \in S} a_{m} t_{m} k\left(\mathbf{x}_{m}, \mathbf{x}_{n}\right)+b&=t_{n} \\ b&=t_{n}-\sum_{m \in S} a_{m} t_{m} k\left(\mathbf{x}_{m}, \mathbf{x}_{n}\right) \end{align} $ For numerical stability, we can average over all the support vectors $\mathcal{S}$ to give $ b=\frac{1}{N_{S}} \sum_{n \in S}\left(t_{n}-\sum_{m \in S} a_{m} t_{m} k\left(\mathbf{x}_{m}, \mathbf{x}_{n}\right)\right) $ ## Soft margin classifiers In maximum margin classifier setup above, we assumed that the training data points are linearly separable with either linear decision boundary, or with nonlinear decision boundary using a nonlinear kernel. But in real world, class condtional distributions usually overlap. Maximum margin classifier would extemely overfit the decision hyperplane in such situation, leading to drastic reduction in generalizability. Therefore, we need to modify it to allow for some training points to be misclassified. This can be done by allowing the points to be on the "wrong" side of the margin boundary, but have to pay a penalty proportional to the distance to the margin boundary. Soft margin classifiers introduce _slack variables_ $\xi_{n} \geq 0$ for $n=1, \ldots, N$ to allow datapoints to be misclassified but incur a penalty. If a datapoint n is on the correct side of the margin, $\xi_{n}=0$ and if on the wrong side of the margin $\xi_{n}=\left|t_{n}-y\left(\mathbf{x}_{n}\right)\right|$. This means points for which $0<\xi_{n} \leqslant 1$ lie inside the margin but on the correct side, and for points with $\xi_{n}>1$ lie on the wrong side of the decision boundary and are misclassified. We now modify the previous hard constraint $t_{n} y\left(\mathbf{x}_{n}\right) \geq 1$ with soft constraint $t_{n} y\left(\mathbf{x}_{n}\right) \geq 1-\xi_{n}$, so the objective becomes $ \underset{\mathbf{w}, b, \xi_{n}}{\arg \min } \frac{1}{2}\|\mathbf{w}\|^{2}+C \sum_{n=1}^{N} \xi_{n} $ subject to constraints $ \begin{aligned} t_{n} y\left(\mathbf{x}_{n}\right) \geq 1-\xi_{n}, & \text { for } n=1, \ldots, N \\ \xi_{n} \geq 0, & \text { for } n=1, \ldots, N \end{aligned} $ The parameter $C > 0$ controls the trade-off between the slack variable penalty and the margin, and is therefore analogous to regularization. Now Lagrangian becomes, $ L(\mathbf{w}, b, \boldsymbol{\xi}, \mathbf{a}, \boldsymbol{\mu})=\frac{1}{2}\|\mathbf{w}\|^{2}+C \sum_{n=1}^{N} \xi_{n}-\sum_{n=1}^{N} a_{n}\left\{t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right)-1+\xi_{n}\right\}-\sum_{n=1}^{N} \mu_{n} \xi_{n} $ where Lagrange multipliers are $a_{n} \geq 0, \quad \mu_{n} \geq 0$. Minimize $L$ w.r.t. primal variables $\mathbf{w}, b, \xi_{n}$: $ \begin{aligned} &\frac{\partial L}{\partial \mathbf{w}}=\mathbf{w}^{T}-\sum_{n=1}^{N} a_{n} t_{n} \mathbf{x}_{n}^{T}=0 \quad \rightarrow \quad\mathbf{w}=\sum_{n=1}^{N} a_{n} t_{n} \mathbf{x}_{n}\\ &\frac{\partial L}{\partial b}=-\sum_{n=1}^{N} a_{n} t_{n}=0 \quad \rightarrow \quad\sum_{n=1}^{N} a_{n} t_{n}=0\\ &\frac{\partial L}{\partial \xi_{n}}=C-a_{n}-\mu_{n}=0 \quad \rightarrow{a_{n}=C-\mu_{n}} \end{aligned} $ The above primal langrangian has the following KKT conditions: $ \begin{array}{rr} a_{n} \geq 0 & \mu_{n} \geq 0 \\ t_{n} y\left(\mathbf{x}_{n}\right)-1+\xi_{n} \geq 0 & \xi_{n} \geq 0 \\ a_{n}\left\{t_{n} y\left(\mathbf{x}_{n}\right)-1+\xi_{n}\right\}=0 & \mu_{n} \xi_{n}=0 \end{array} $ Note: The stationary conditions obtained above by optimizing Langrangian with primal variables are also considered to be the part of KKT conditions. Now we eliminate $\mathbf{w}, b, \xi_{n}$ from the above primal Lagrangian using the above KKT conditions to obtain the dual formulation: $ \tilde{L}(\mathbf{a})=\sum_{n=1}^{N} a_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} a_{n} a_{m} t_{n} t_{m} \mathbf{x}_{n}^{T} \mathbf{x}_{m} $ which is identical to the hard margin case, but the constraints are instead subject to the $0 \leq a_{n} \leq C, \quad \sum_{n=1}^{N} a_{n} t_{n}=0$ and are called \*box constraints. The dual formation allows the use of the kernel trick. We solve for the dual problem using a numerical optimizer for the values of $\mathbf{a}_n$. Then we can classify new points using $ y(\mathbf{x})=\sum_{n=1}^{N} a_{n} t_{n} k\left(\mathbf{x}_{n}, \mathbf{x}\right)+b $ We are also interested in which points finally give the sparse solution i.e. wich points are the support vectors. The support vectors are the points for which $a_{n}>0$. In that case $t_{n}0 y\left(\mathbf{x}_{n}\right)=1-\xi_{n}$. Then we have two scenarios: 1. If $a_{n}<C$ then $\mu_{n}>0$ so $\xi_{n}=0$, the points lie on the margin. 2. If $a_{n}=C$ then $\mu_{n}=0$ so $\xi_{n} \geq 0$ 1. If $\xi_{n} \leq 1$, points are within the margin and classified correctly 2. If $\xi_{n}>1$, points are misclassified ![[Screenshot 2020-10-11 at 7.52.47 PM.jpg]] When $C \rightarrow \infty$, we get hard margin classifier as above as the penalty goes to infinity for misclassifications, we get atleast 2 support vectors. When $C \rightarrow 0$, we get infinite margin and every point becomes a support vector. For large $C$ we expect a more complex decision boundary than for small $C$. The classifier is more sensitive to outliers for large $C$ than for small $C$. ## References 1. 7.0, 7.1 Bishop 2006 2. Quadratic Programming with Python and CVXOPT https://courses.csail.mit.edu/6.867/wiki/images/a/a7/Qp-cvxopt.pdf