# Difference between revisions of "stat841f14"

## Principal components analysis (Lecture 1: Sept. 10, 2014)

### Introduction

Principal Component Analysis (PCA), first invented by Karl Pearson in 1901, is a statistical technique for data analysis. Its main purpose is to reduce the dimensionality of the data.

Suppose there is a set of data points in a d-dimensional space. The goal of PCA is to find a linear subspace with lower dimensionality $\, p$ ($p \leq d$), such that maximum variance is retained in this lower-dimensional space. The linear subspace can be specified by p orthogonal vectors, such as $\, u_1 , u_2 , ... , u_p$, which form a new coordinate system. Ideally $\, p \ll d$ (worst case would be to have $\, p = d$). These vectors are called the 'Principal Components'. In other words, PCA aims to reduce the dimensionality of the data, while preserving its information (or minimizing the loss of information). Information comes from variation. In other words, capturing more variation in the lower dimensional subspace means that more information about the original data is preserved.

For example, if all data points have the same value along one dimension (as depicted in the figures below), then that dimension does not carry any information. To preserve the original information in a lower dimensional subspace, the subspace needs to contain components (or dimensions) along which data has most of its variability. In this case, we can ignore the dimension where all data points have the same value. In practical problems, however, finding a linear subspace of lower dimensionality, where no information about the data is lost, is not possible. Thus the loss of information is inevitable. Through PCA, we try to reduce this loss and capture most of the features of data.

The figure below demonstrates an example of PCA. Data is transformed from original 3D space to 2D coordinate system where each coordinate is a principal component.

Now, consider the ability of the two of the above example components (PC1 and PC2) to retain information from the original data. The data in the original space is projected onto each of these two components separately in the figures below. Notice that PC1 is better able to capture the variation in the original data than PC2. If the goal was to reduce the original d=3 dimensional data to p=1 dimension, PC1 would be preferable to PC2.

Comparison of two different components used for dimensionality reduction

### PCA Plot example

For example<ref> https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson05/PCA_plot.gif </ref>, in the top left corner of the image below, the point $x_1$ is shown in a two-dimensional space and it's coordinates are $(x_i, 1)$ and $(x_i, 2)$

All the red points in the plot represented by their projected values on the two original coordinators (Feature 1, Feature 2).

In PCA technique, it uses new coordinators (showed in green). As coordinate system changed, the points also are shown by their new pair of coordinate values for example $(z_i, 1)$ and $(z_i, 2)$. These values are determined by original vector, and the relation between the rotated coordinates and the original ones.

In the original coordinates the two vectors $x_1$ and $x_2$ are not linearly uncorrelated. In other words, after applying PCA, if there are two principal component, and use those as the new coordinates, then it is guaranteed that the data along the two coordinates have correlation = 0

By rotating the coordinates we performed an orthonormal transform on the data. After the transform, the correlation is removed - which is the whole point of PCA.

As an example of an extreme case where all of the points lie on a straight line, which in the original coordinates system it is needed two coordinates to describe these points.

In short, after rotating the coordinates, we need only one coordinate and clearly the value on the other one is always zero. This example shows PCA's dimension reduction in an extreme case while in real world points may not fit exactly on a line, but only approximately.

### PCA applications

As mentioned, PCA is a method to reduce data dimension if possible to principal components such that those PCs cover as much data variation as possible. This technique is useful in different type of applications which involve data with a huge dimension like data pre-processing, neuroscience, computer graphics, meteorology, oceanography, gene expression, economics, and finance among of all other applications.

Data usually is represented by lots of variables. In data preprocessing, PCA is a technique to select a subset of variables in order to figure our best model for data. In neuroscience, PCA used to identify the specific properties of a stimulus that increase a neuron’s probability of generating an action potential.

Figures below show some example of PCA in real world applications. Click to enlarge.

Tracking the useful features based on PCA:

Since in real world application, we need to know which features are important after dimensionality reduction. Then we can use those features as our new data set to continuous our project. To do this, we need to check the principal components have been chosen. Values of principal components or eigenvectors represent the weight of features within that specific principal component. Based on these weights we can tell the importance of features in the principal component. Then choose several most importance features from the original data instead of using all.

### Mathematical details

PCA is a transformation from original (d-dimensional) space to a linear subspace with a new (p-dimensional) coordinate system. Each coordinate of this subspace is called a Principle Component. First principal component is the coordinate of this system along which the data points has the maximum variation. That is, if we project the data points along this coordinate, maximum variance of data is obtained (compared to any other vector in original space). Second principal component is the coordinate in the direction of the second greatest variance of the data, and so on.

For example, if we are mapping from 3 dimensional space to 2 dimensions, the first principal component is the 2 dimensional plane which has the maximum variance of data when each 3D data point is mapped to its nearest 2D point on the 2D plane.

Lets denote the basis of original space by $\mathbf{v_1}$, $\mathbf{v_2}$, ... , $\mathbf{v_d}$. Our goal is to find the principal components (coordinate of the linear subspace), denoted by $\mathbf{u_1}$, $\mathbf{u_2}$, ... , $\mathbf{u_p}$ in the hope that $p \leq d$. First, we would like to obtain the first principal component $\mathbf{u_1}$ or the components in the direction of maximum variance. This component can be treated as a vector in the original space and so is written as a linear combination of the basis in original space.

$\mathbf{u_1}=w_1\mathbf{v_1}+w_2\mathbf{v_2}+...+w_d\mathbf{v_d}$

Vector $\mathbf{w}$ contains the weight of each basis in this combination.

$\mathbf{w}=\begin{bmatrix} w_1\\w_2\\w_3\\...\\w_d \end{bmatrix}$

Suppose we have n data points in the original space. We represent each data points by $\mathbf{x_1}$, $\mathbf{x_2}$, ..., $\mathbf{x_n}$. Projection of each point $\mathbf{x_i}$ on the $\mathbf{u_1}$ is $\mathbf{w}^T\mathbf{x_i}$.

Let $\mathbf{S}$ be the sample covariance matrix of data points in original space. The variance of the projected data points, denoted by $\Phi$ is

$\Phi = Var(\mathbf{w}^T \mathbf{x_i}) = \mathbf{w}^T \mathbf{S} \mathbf{w}$

we would like to maximize $\Phi$ over set of all vectors $\mathbf{w}$ in original space. But, this problem is not yet well-defined, because for any choice of $\mathbf{w}$, we can increase $\Phi$ by simply multiplying $\mathbf{w}$ by a positive scalar greater than one. And the maximum will goes to infinity which is not the proper solution for our problem. So, we add the following constraint to the problem to bound the length of vector $\mathbf{w}$:

$max_w \Phi = \mathbf{w}^T \mathbf{S} \mathbf{w}$

subject to : $\mathbf{w}^T \mathbf{w} =1$

This constraint makes $\mathbf{w}$ the unit vector in d-dimensional euclidian space and as a result the optimization problem is no longer unconstrained.

Using Lagrange Multiplier technique we have:

$L(\mathbf{w} , \lambda ) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda (\mathbf{w}^T \mathbf{w} - 1 )$

By taking derivative of $L$ w.r.t. primary variable $\mathbf{w}$ we have:

$\frac{\partial L}{\partial \mathbf{w}}=( \mathbf{S}^T + \mathbf{S})\mathbf{w} -2\lambda\mathbf{w}= 2\mathbf{S}\mathbf{w} -2\lambda\mathbf{w}= 0$

Note that $\mathbf{S}$ is symmetric so $\mathbf{S}^T = \mathbf{S}$.

From the above equation we have:

$\mathbf{S} \mathbf{w} = \lambda\mathbf{w}$.

So $\mathbf{w}$ is the eigenvector and $\lambda$ is the eigenvalue of the matrix $\mathbf{S}$ .(Taking derivative of $L$ w.r.t. $\lambda$ just regenerate the constraint of the optimization problem.)

By multiplying both sides of the above equation to $\mathbf{w}^T$ and considering the constraint, we obtain:

$\Phi = \mathbf{w}^T \mathbf{S} \mathbf{w} = \mathbf{w}^T \lambda \mathbf{w} = \lambda \mathbf{w}^T \mathbf{w} = \lambda$

The interesting result is that objective function is equal to eigenvalue of the covariance matrix. So, to obtain the first principle component, which maximizes the objective function, we just need the eigenvector corresponding to the largest eigenvalue of $\mathbf{S}$. Subsequently, the second principal component is the eigenvector corresponding to the second largest eigenvalue, and so on.

### Principal component extraction using singular value decomposition

Singular Value Decomposition (SVD), is a well-known way to decompose any kind of matrix $\mathbf{A}$ (m by n) into three useful matrices.

$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$.

where $\mathbf{U}$ is m by m unitary matrix, $\mathbf{UU^T} = I_m$, each column of $\mathbf{U}$ is an eigenvector of $\mathbf{AA^T}$

$\mathbf{\Sigma}$ is m by n diagonal matrix, non-zero elements of this matrix are square roots of eigenvalues of $\mathbf{AA^T}$.

$\mathbf{V}$ is n by n unitary matrix, $\mathbf{VV^T} = I_n$, each column of $\mathbf{V}$ is an eigenvector of $\mathbf{A^TA}$

Now, comparing the concepts of PCA and SVD, one may find out that we can perform PCA using SVD.

Let's construct a matrix p by n with our n data points such that each column of this matrix represent one data point in p-dimensional space:

$\mathbf{X} = [\mathbf{x_1} \mathbf{x_2} .... \mathbf{x_n} ]$

and make another matrix $\mathbf{X}^*$ simply by subtracting the mean of data points from $\mathbf{X}$.

$\mathbf{X}^* = \mathbf{X} - \mu_X[1, 1, ... , 1], \mu_X = \frac{1}{n} \sum_{i=1}^n x_i$

Then we will get a zero-mean version of our data points for which $\mathbf{X^*} \mathbf{X^{*^T}}$ is the covariance matrix.

So, $\mathbf{\Sigma}$ and $\mathbf{U}$ give us the eigenvalues and corresponding eigenvectors of covariance matrix, respectively. We can then easily extract the desired principal components.

### MATLAB example

In this example we use different pictures of a man's face with different facial expressions. But, these pictures are noisy. So the face is not easily recognizable. The data set consists of 1965 pictures each 20 by 28. So dimensionality is 20*28= 560.

Our goal is to remove noise using PCA and of course by means of SVD function in matlab. We know that noise is not the main feature that makes these pictures look different, so noise is among least variance components. We first extract the principal components and then remove those which correspond to the least values of eigenvalues of covariance matrix. Here, eigenvalues are sorted in a descending order from column to column of matrix S. For example, the first column of matrix U, which corresponds to the first element in S, is the first principal component.

We then reconstruct the picture using first d principal components. If d is too large, we can not completely remove the noise. If it is too small, we will lose some information from the original data. For example we may lose the facial expression (smile, sadness and etc.). We can change d until we achieve a reasonable balance between noise reduction and information loss.

File:noisy-face.jpg
Noise reduced version of a picture in MATLAB example
 >> % loading the noisy data, file "noisy" stores our variable X which contains the pictures
>>
>>
>> % show a sample image in column 1 of matrix X
>> imagesc(reshape(X(:,1),20,28)')
>>
>> % set the color of image to grayscale
>> colormap gray
>>
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs
>> [U S V] = svd(X);
>>
>> d= 10;
>>
>> % reconstruct X using only the first d principal components and removing noise
>> XX = U(:, 1:d)* S(1:d,1:d)*V(:,1:d)' ;
>>
>> % show image in column 1 of XX which is a noise reduced version
>> imagesc(reshape(XX(:,1),20,28)')


## PCA continued, Lagrange multipliers, singular value decomposition (SVD) (Lecture 2: Sept. 15, 2014)

### Principal component analysis (continued)

PCA is a method to reduce dimensions in data or to extract features from data.

Given a data point in vector form $\,x$ the goal of PCA is to map $\,x$ to $\,y$ where $\,x$ $\,\isin$ $\,\real$d and $\,y$ $\,\isin$ $\,\real$p such that p is much less than d. For example we could map a two-dimensional data point onto any one-dimensional vector in the original 2D space or we could map a three-dimensional data point onto any 2D plane in the original 3D space.

The transformation from d-dimensional space to p-dimensional space is chosen in such a way that maximum variance is retained in the lower dimensional subspace. This is useful because variance represents the main differences between the data points, which is exactly what we are trying to capture when we perform data visualization. For example, when taking 64-dimensional images of hand-written digits, projecting them onto 2-dimensional space using PCA captures a significant amount of the structure we care about.

In terms of data visualization, it is fairly clear why PCA is important. High dimensional data is sometimes impossible to visualize. Even a 3D space can be fairly difficult to visualize, while a 2D space, which can be printed on a piece of paper, is much easier to visualize. If a higher dimensional dataset can be reduced to only 2 dimensions, it can be easily represented by plots and graphs.

In the case of many data points $\,x_i$ in d-dimensional space:

$\,X_{dxn} = [x_1 x_2 ... x_n]$

There is an infinite amount of vectors in $\,\real$p to which the points in X can be mapped. When mapping these data points to lower dimensions information will be lost. To preserve as much information as possible the points are mapped to a vector in $\real$p w