Principle Component Analysis (PCA)

# Principle Component Analysis (PCA) Principle component analysis, or PCA, is a dimensionality reduction algorithm. There are two deterministic definitions/derivations of PCA: ## Maximal variance formulation Given the data $\mathbf{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right\}, \mathbf{x}_{n} \in \mathbb{R}^{D}$, the goal of PCA of is to project data into a $M<D$ dimensional space while **maximizing the variance** of the projected data. Let's consider the projection onto a one-dimensional space(M=1). We can define the direction of this space using a D-dimensional vector $\mathbf{u}_1$, which for convenience we shall choose to be unit vector so that $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{u}_{1}=1$ (we are only interested in direction, not magnitude). Each datapoint $\mathbf{x}_n$ is projected onto a scalar value $\mathbf{u}_1^T\mathbf{x}_n$. The mean of the projected data is $\mathbf{u}_{1}^{\mathrm{T}} \overline{\mathrm{x}}$ where $\overline{\mathrm{x}}$ is the sample set mean given by, $ \overline{\mathrm{x}}=\frac{1}{N} \sum_{n=1}^{N} \mathrm{x}_{n} $ and the variance of the projected data is given by, $ \begin{aligned} \frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T} \mathbf{x}_{n}-\mathbf{u}_{1}^{T} \overline{\mathbf{x}}\right)^{2} &=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\right)^{2} \\ &=\frac{1}{N} \sum_{n=1}^{N} \mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{1} \\ &=\mathbf{u}_{1}^{T}\left(\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T}\right) \mathbf{u}_{1}\\ &={\mathbf{u}_{1}^{T} \mathbf{S u}_{1}} \end{aligned} $ where $\mathbf{S}$ is the data covariance matrix defined by $ \mathbf{S}=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{\mathrm{T}} $ We now maximize the projected variance with $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{S} \mathbf{u}_{1}$ with respect to $\mathbf{u}_1$. This has to be a constrained maximization to prevent $\left\|\mathbf{u}_{1}\right\| \rightarrow \infty$. The constraint we take comes from the normalization condition $\mathbf{u}_{1}^{\Gamma} \mathbf{u}_{1}=1$. Then from the method of [[Lagrange Multipliers]], $ L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{u}_{1}^{T} \mathbf{S} \mathbf{u}_{1}+\lambda_{1}\left(\mathbf{u}_{1}^{T} \mathbf{u}_{1}-1\right) $ Then, $ \frac{\partial}{\partial \mathbf{u}_{1}} L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{S} \mathbf{u}_{1}-\lambda_{1} \mathbf{u}_{1}=0 $ By setting the derivative with respect to $\mathbf{u}_{1}$ equal to zero, we see that this quantity will have a stationary point when $ \mathbf{S u}_{1}=\lambda_{1} \mathbf{u}_{1} $ which means that $\mathbf{u}_{1}$ must be an [[Linear Algebra#Eigen Decomposition|eigen vector]] of $\mathbf{S}$, and the variance will be maximum when the eigenvector has the largest eigenvalue $\lambda_1$. This eigenvector is known as the first principle component. The (total) variance of the projected data is $\operatorname{Tr}[\operatorname{Cov}[\mathbf{z}]]=\sum_{j=1}^{M} \lambda_{j}$. Thus principle component analysis involves evaluating the mean and the covariance matrix of the data set and then finding the M eigenvectors of $\mathbf{S}$ with the M largest eigenvalues. ## Minimum error formulation PCA can be obtained by minimizing the reconstruction error. To do this, we represent the points in a different orthonormal basis $\left\{\mathbf{u}_{i}: \mathbf{u}_{i}^{T} \mathbf{u}_{i}=1\right\}_{i=1}^{D}$ where, $ \mathbf{u}_{i}^{T} \mathbf{u}_{j}=\left\{\begin{array}{ll} 1 & \text { if } i=j \\ 0 & \text { otherwise } \end{array}\right. $ Because this basis is complete, each data point can be represented exactly by a linear combination of the [[Linear Algebra#Basis Vectors|basis vectors]]. $ \mathbf{x}_{n}=\sum_{i=1}^{D} \alpha_{n i} \mathbf{u}_{i} $ This means the original data points $\mathbf{X} = \left\{x_{n 1}, \ldots, x_{n D}\right\}$ are replaced by $\left\{\alpha_{n 1}, \ldots, \alpha_{n D}\right\}$. Taking the property of orthonormal basis, we have,$\alpha_{n i}=\mathbf{x}_{n}^{T} \mathbf{u}_{i}$, and so, $ \mathbf{x}_{n}=\sum_{i=1}^{D}\left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right) \mathbf{u}_{i} $ Our goal here is to approximate data point using a representation that is restricted with M < D. So, $ \tilde{\mathbf{x}}_{n}=\sum_{i=1}^{M}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}+\sum_{i=M+1}^{D} b_{i} \mathbf{u}_{i} $ We are free to chose $\mathbf{u}_i$ and $z_{n i} = \left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right)$ and $b_i$ so as to minimize the reconstruction error $ \frac{1}{N} \sum_{n=1}^{N}\left\|\mathbf{x}_{n}-\tilde{\mathbf{x}}_{n}\right\|^{2}=\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}-b_{i} \mathbf{u}_{i}\right\|^{2} $ Minimizing for both $\mathbf{u}_i$ and $b_i$, we get, $b_{i}=\bar{x}^{T} \mathbf{u}_{i}$. For $\mathbf{u}_i$, $ \begin{align} &\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right\|^{2}\\ &= \frac{1}{N} \sum_{n=1}^{N}\left(\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right)^{T}\left(\sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \mathbf{u}_{j}\right) \\ &= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=M+1}^{D} \sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}^{T} \mathbf{u}_{j}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \\ &= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=1 / 4+1}^{D} \mathbf{u}_{i}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i} \\ &= \sum_{i=M+1}^{D} \mathbf{u}_{i}^{T} \mathbf{S} \mathbf{u}_{i} \end{align} $ Minimizing $\mathbf{u}_{i}$ under constraint $\mathbf{u}_{i}^{T} \mathbf{u}_{i}=1$ (else we get $\mathbf{u}_i=0$) with the method of [[Lagrange Multipliers]], we get a general solution as $ \mathbf{S} \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i} $ Thus the solution is obtained by finding the largest M eigenvectors of the covariance matrix $\mathbf{S}$. such that the remaining D-M are the smallest. ## Linear Projection for dimensionality reduction The final k-dimensional linear projection $y_n$ is obtained as $ y_n = \boldsymbol{L}x_n $ were $\boldsymbol{L}=\boldsymbol{\Lambda}^{\frac{1}{2}}_k \boldsymbol{U}_k^T$ --- ## References 1. 12.1.1, Bishop 2006, [[Lecture 10 - ML1]] 2. http://www.cs.cmu.edu/~tom/10601_fall2012/slides/pca.pdf 3. Tutorial on PCA, university of Otago http://www.cs.otago.ac.nz/research/publications/OUCS-2002-12.pdf