# Principle Component Analysis (PCA)
Principle component analysis, or PCA, is a dimensionality reduction algorithm.
There are two deterministic definitions/derivations of PCA:
## Maximal variance formulation
Given the data $\mathbf{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right\}, \mathbf{x}_{n} \in \mathbb{R}^{D}$, the goal of PCA of is to project data into a $M<D$ dimensional space while **maximizing the variance** of the projected data.
Let's consider the projection onto a one-dimensional space(M=1). We can define the direction of this space using a D-dimensional vector $\mathbf{u}_1$, which for convenience we shall choose to be unit vector so that $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{u}_{1}=1$ (we are only interested in direction, not magnitude). Each datapoint $\mathbf{x}_n$ is projected onto a scalar value $\mathbf{u}_1^T\mathbf{x}_n$. The mean of the projected data is $\mathbf{u}_{1}^{\mathrm{T}} \overline{\mathrm{x}}$ where $\overline{\mathrm{x}}$ is the sample set mean given by,
$
\overline{\mathrm{x}}=\frac{1}{N} \sum_{n=1}^{N} \mathrm{x}_{n}
$
and the variance of the projected data is given by,
$
\begin{aligned}
\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T} \mathbf{x}_{n}-\mathbf{u}_{1}^{T} \overline{\mathbf{x}}\right)^{2} &=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\right)^{2} \\
&=\frac{1}{N} \sum_{n=1}^{N} \mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{1} \\
&=\mathbf{u}_{1}^{T}\left(\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T}\right) \mathbf{u}_{1}\\
&={\mathbf{u}_{1}^{T} \mathbf{S u}_{1}}
\end{aligned}
$
where $\mathbf{S}$ is the data covariance matrix defined by
$
\mathbf{S}=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{\mathrm{T}}
$
We now maximize the projected variance with $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{S} \mathbf{u}_{1}$ with respect to $\mathbf{u}_1$. This has to be a constrained maximization to prevent $\left\|\mathbf{u}_{1}\right\| \rightarrow \infty$. The constraint we take comes from the normalization condition $\mathbf{u}_{1}^{\Gamma} \mathbf{u}_{1}=1$. Then from the method of [[Lagrange Multipliers]],
$
L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{u}_{1}^{T} \mathbf{S} \mathbf{u}_{1}+\lambda_{1}\left(\mathbf{u}_{1}^{T} \mathbf{u}_{1}-1\right)
$
Then,
$
\frac{\partial}{\partial \mathbf{u}_{1}} L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{S} \mathbf{u}_{1}-\lambda_{1} \mathbf{u}_{1}=0
$
By setting the derivative with respect to $\mathbf{u}_{1}$ equal to zero, we see that this quantity will have a stationary point when
$
\mathbf{S u}_{1}=\lambda_{1} \mathbf{u}_{1}
$
which means that $\mathbf{u}_{1}$ must be an [[Linear Algebra#Eigen Decomposition|eigen vector]] of $\mathbf{S}$, and the variance will be maximum when the eigenvector has the largest eigenvalue $\lambda_1$. This eigenvector is known as the first principle component.
The (total) variance of the projected data is $\operatorname{Tr}[\operatorname{Cov}[\mathbf{z}]]=\sum_{j=1}^{M} \lambda_{j}$.
Thus principle component analysis involves evaluating the mean and the covariance matrix of the data set and then finding the M eigenvectors of $\mathbf{S}$ with the M largest eigenvalues.
## Minimum error formulation
PCA can be obtained by minimizing the reconstruction error. To do this, we represent the points in a different orthonormal basis $\left\{\mathbf{u}_{i}: \mathbf{u}_{i}^{T} \mathbf{u}_{i}=1\right\}_{i=1}^{D}$ where,
$
\mathbf{u}_{i}^{T} \mathbf{u}_{j}=\left\{\begin{array}{ll}
1 & \text { if } i=j \\
0 & \text { otherwise }
\end{array}\right.
$
Because this basis is complete, each data point can be represented exactly by a linear combination of the [[Linear Algebra#Basis Vectors|basis vectors]].
$
\mathbf{x}_{n}=\sum_{i=1}^{D} \alpha_{n i} \mathbf{u}_{i}
$
This means the original data points $\mathbf{X} = \left\{x_{n 1}, \ldots, x_{n D}\right\}$ are replaced by $\left\{\alpha_{n 1}, \ldots, \alpha_{n D}\right\}$. Taking the property of orthonormal basis, we have,$\alpha_{n i}=\mathbf{x}_{n}^{T} \mathbf{u}_{i}$, and so,
$
\mathbf{x}_{n}=\sum_{i=1}^{D}\left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right) \mathbf{u}_{i}
$
Our goal here is to approximate data point using a representation that is restricted with M < D. So,
$
\tilde{\mathbf{x}}_{n}=\sum_{i=1}^{M}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}+\sum_{i=M+1}^{D} b_{i} \mathbf{u}_{i}
$
We are free to chose $\mathbf{u}_i$ and $z_{n i} = \left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right)$ and $b_i$ so as to minimize the reconstruction error
$
\frac{1}{N} \sum_{n=1}^{N}\left\|\mathbf{x}_{n}-\tilde{\mathbf{x}}_{n}\right\|^{2}=\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}-b_{i} \mathbf{u}_{i}\right\|^{2}
$
Minimizing for both $\mathbf{u}_i$ and $b_i$, we get, $b_{i}=\bar{x}^{T} \mathbf{u}_{i}$.
For $\mathbf{u}_i$,
$
\begin{align}
&\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right\|^{2}\\
&= \frac{1}{N} \sum_{n=1}^{N}\left(\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right)^{T}\left(\sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \mathbf{u}_{j}\right) \\
&= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=M+1}^{D} \sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}^{T} \mathbf{u}_{j}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \\
&= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=1 / 4+1}^{D} \mathbf{u}_{i}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i} \\
&= \sum_{i=M+1}^{D} \mathbf{u}_{i}^{T} \mathbf{S} \mathbf{u}_{i}
\end{align}
$
Minimizing $\mathbf{u}_{i}$ under constraint $\mathbf{u}_{i}^{T} \mathbf{u}_{i}=1$ (else we get $\mathbf{u}_i=0$) with the method of [[Lagrange Multipliers]], we get a general solution as
$
\mathbf{S} \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i}
$
Thus the solution is obtained by finding the largest M eigenvectors of the covariance matrix $\mathbf{S}$. such that the remaining D-M are the smallest.
## Linear Projection for dimensionality reduction
The final k-dimensional linear projection $y_n$ is obtained as
$
y_n = \boldsymbol{L}x_n
$
were $\boldsymbol{L}=\boldsymbol{\Lambda}^{\frac{1}{2}}_k \boldsymbol{U}_k^T$
---
## References
1. 12.1.1, Bishop 2006, [[Lecture 10 - ML1]]
2. http://www.cs.cmu.edu/~tom/10601_fall2012/slides/pca.pdf
3. Tutorial on PCA, university of Otago http://www.cs.otago.ac.nz/research/publications/OUCS-2002-12.pdf