# Perceptron
Perceptron is an example of [[Discriminant Functions|linear discriminant function]] which occupies an important place in the history of pattern recognition algorithms as being an old and elegant algorithm.
Preceptron uses generalized linear model with an activation function.
$
y(\mathbf{x})=f\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})\right)
$
This activation function $f$ is a step function from -1 to 1 of the form
$
f(a)=\left\{\begin{array}{ll}
+1, & a \geqslant 0 \\
-1, & a<0
\end{array}\right.
$
For targets, we use target values $t=+1$ for class $\mathcal{C}_{1}$ and $t=-1$ for class $\mathcal{C}_{2}$ instead of K-coding scheme like in [[Least squares for classification]].
A natural choice for error function is the total number of misclassifications. However because the error is a piecewise constant function of $\mathbf{w}$, with discontinuities wherever a change in parameters causes the decision boundary to move across one of the data points, methods based on changing $\mathbf{w}$ using the gradient of the error function cannot then be applied, because the gradient is zero almost everywhere.
An alternate error function called _perceptron criterion_ is instead used given as
$
E_{\mathrm{P}}(\mathbf{w})=-\sum_{n \in \mathcal{M}} \mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}_{n} t_{n}
$
where $\mathcal{M}$ denotes all misclassified datapoints.
We now apply [[Stochastic Gradient Descent]] to this error function.
$
\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E_{\mathrm{P}}(\mathbf{w})=\mathbf{w}^{(\tau)}+\eta \phi_{n} t_{n}
$
The _perceptron convergence theorem_ ensures that if the training data set is linearly separable, then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps.
![[training perceptron.jpg]]
However we have to keep in mind that until convergence is achieved, we will not be able to distinguish between a non-separable problem and one that is simply slow to converge. Sometimes there might be different solutions depending on the initialization. For data sets not linearly separable, the perceptron algorithm will never converge.
## Limitations of perceptron algorithm
1. Works only for 2 classes, doesn't generalize readily to multiple classes
2. There might be many solutions depending on the initialization of w and on the order in which data is presented in SGD
3. If dataset is not linearly separable, the perceptron algorithm will not converge.
4. Based on linear combination of fixed basis functions.