# Difference between revisions of "stat841f14"

## Principal components analysis (Lecture 1: Sept. 10, 2014)

### Introduction

Principal Component Analysis (PCA), first invented by Karl Pearson in 1901, is a statistical technique for data analysis. Its main purpose is to reduce the dimensionality of the data.

Suppose there is a set of data points in a d-dimensional space. The goal of PCA is to find a linear subspace with lower dimensionality p (p $\leq$ d), such that maximum variance is retained in this lower-dimensional space. The linear subspace can be specified by p orthogonal vectors, such as $u_1 , u_2 , ... , u_p$, which form a new coordinate system. Ideally p << d (worst case would be to have p == d). These vectors are called the 'Principle Components'. In other words, PCA aims to reduce the dimensionality of the data, while preserving its information (or minimizing the loss of information). Information comes from variation. In other words, the more variation is captured in the lower dimensional subspace, the more information about the original data is preserved.

For example, consider if all data points have the same value along one dimension (as depicted in the figures below), then that dimension does not carry any information. To preserve the original information in a lower dimensional subspace, the subspace needs to contain components (or dimensions) along which data has most of its variability. In this case, we can ignore the dimension where all data points have the same value. In practical problems, however, finding a linear subspace of lower dimensionality, where no information about the data is lost, is not possible. Thus the loss of information is inevitable. Through PCA, we try to reduce this loss and capture most of the features of data.

The figure below demonstrates an example of PCA. Data is transformed from original 3D space to 2D coordinate system where each coordinate is a principal component.

Now, consider the ability of the two of the above example components (PC1 and PC2) to retain information from the original data. The data in the original space is projected onto each of these two components separately in the figures below. Notice that PC1 is better able to capture the variation in the original data than PC2. If the goal was to reduce the original d=3 dimensional data to p=1 dimension, PC1 would be preferable to PC2.

Comparison of two different components used for dimensionality reduction

### PCA Plot example

For example<ref> https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson05/PCA_plot.gif </ref>, in the top left corner of the image below, the point $x_1$ is shown in a two-dimensional space and it's coordinates are $(x_i, 1)$ and $(x_i, 2)$

All the red points in the plot represented by their projected values on the two original coordinators (Feature 1, Feature 2).

In PCA technique, it uses new coordinators (showed in green). As coordinate system changed, the points also are shown by their new pair of coordinate values for example $(z_i, 1)$ and $(z_i, 2)$. These values are determined by original vector, and the relation between the rotated coordinates and the original ones.

In the original coordinates the two vectors $x_1$ and $x_2$ are not linearly uncorrelated. In other words, after applying PCA, if there are two principal component, and use those as the new coordinates, then it is guaranteed that the data along the two coordinates have correlation = 0

By rotating the coordinates we performed an orthonormal transform on the data. After the transform, the correlation is removed - which is the whole point of PCA.

As an example of an extreme case where all of the points lie on a straight line, which in the original coordinates system it is needed two coordinates to describe these points.

In short, after rotating the coordinates, we need only one coordinate and clearly the value on the other one is always zero. This example shows PCA's dimension reduction in an extreme case while in real world points may not fit exactly on a line, but only approximately.

### PCA applications

As mentioned, PCA is a method to reduce data dimension if possible to principal components such that those PCs cover as much data variation as possible. This technique is useful in different type of applications which involve data with a huge dimension like data pre-processing, neuroscience, computer graphics, meteorology, oceanography, gene expression, economics, and finance among of all other applications.

Data usually is represented by lots of variables. In data preprocessing, PCA is a technique to select a subset of variables in order to figure our best model for data. In neuroscience, PCA used to identify the specific properties of a stimulus that increase a neuron’s probability of generating an action potential.

Figures below show some example of PCA in real world applications. Click to enlarge.

Tracking the useful features based on PCA:

Since in real world application, we need to know which features are important after dimensionality reduction. Then we can use those features as our new data set to continuous our project. To do this, we need to check the principal components have been chosen. Values of principal components or eigenvectors represent the weight of features within that specific principal component. Based on these weights we can tell the importance of features in the principal component. Then choose several most importance features from the original data instead of using all.

### Mathematical details

PCA is a transformation from original (d-dimensional) space to a linear subspace with a new (p-dimensional) coordinate system. Each coordinate of this subspace is called a Principle Component. First principal component is the coordinate of this system along which the data points has the maximum variation. That is, if we project the data points along this coordinate, maximum variance of data is obtained (compared to any other vector in original space). Second principal component is the coordinate in the direction of the second greatest variance of the data, and so on.

For example, if we are mapping from 3 dimensional space to 2 dimensions, the first principal component is the 2 dimensional plane which has the maximum variance of data when each 3D data point is mapped to its nearest 2D point on the 2D plane.

Lets denote the basis of original space by $\mathbf{v_1}$, $\mathbf{v_2}$, ... , $\mathbf{v_d}$. Our goal is to find the principal components (coordinate of the linear subspace), denoted by $\mathbf{u_1}$, $\mathbf{u_2}$, ... , $\mathbf{u_p}$ in the hope that $p \leq d$. First, we would like to obtain the first principal component $\mathbf{u_1}$ or the components in the direction of maximum variance. This component can be treated as a vector in the original space and so is written as a linear combination of the basis in original space.

$\mathbf{u_1}=w_1\mathbf{v_1}+w_2\mathbf{v_2}+...+w_d\mathbf{v_d}$

Vector $\mathbf{w}$ contains the weight of each basis in this combination.

$\mathbf{w}=\begin{bmatrix} w_1\\w_2\\w_3\\...\\w_d \end{bmatrix}$

Suppose we have n data points in the original space. We represent each data points by $\mathbf{x_1}$, $\mathbf{x_2}$, ..., $\mathbf{x_n}$. Projection of each point $\mathbf{x_i}$ on the $\mathbf{u_1}$ is $\mathbf{w}^T\mathbf{x_i}$.

Let $\mathbf{S}$ be the sample covariance matrix of data points in original space. The variance of the projected data points, denoted by $\Phi$ is

$\Phi = Var(\mathbf{w}^T \mathbf{x_i}) = \mathbf{w}^T \mathbf{S} \mathbf{w}$

we would like to maximize $\Phi$ over set of all vectors $\mathbf{w}$ in original space. But, this problem is not yet well-defined, because for any choice of $\mathbf{w}$, we can increase $\Phi$ by simply multiplying $\mathbf{w}$ by a positive scalar greater than one. And the maximum will goes to infinity which is not the proper solution for our problem. So, we add the following constraint to the problem to bound the length of vector $\mathbf{w}$:

$max_w \Phi = \mathbf{w}^T \mathbf{S} \mathbf{w}$

subject to : $\mathbf{w}^T \mathbf{w} =1$

This constraint makes $\mathbf{w}$ the unit vector in d-dimensional euclidian space and as a result the optimization problem is no longer unconstrained.

Using Lagrange Multiplier technique we have:

$L(\mathbf{w} , \lambda ) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda (\mathbf{w}^T \mathbf{w} - 1 )$

By taking derivative of $L$ w.r.t. primary variable $\mathbf{w}$ we have:

$\frac{\partial L}{\partial \mathbf{w}}=( \mathbf{S}^T + \mathbf{S})\mathbf{w} -2\lambda\mathbf{w}= 2\mathbf{S}\mathbf{w} -2\lambda\mathbf{w}= 0$

Note that $\mathbf{S}$ is symmetric so $\mathbf{S}^T = \mathbf{S}$.

From the above equation we have:

$\mathbf{S} \mathbf{w} = \lambda\mathbf{w}$.

So $\mathbf{w}$ is the eigenvector and $\lambda$ is the eigenvalue of the matrix $\mathbf{S}$ .(Taking derivative of $L$ w.r.t. $\lambda$ just regenerate the constraint of the optimization problem.)

By multiplying both sides of the above equation to $\mathbf{w}^T$ and considering the constraint, we obtain:

$\Phi = \mathbf{w}^T \mathbf{S} \mathbf{w} = \mathbf{w}^T \lambda \mathbf{w} = \lambda \mathbf{w}^T \mathbf{w} = \lambda$

The interesting result is that objective function is equal to eigenvalue of the covariance matrix. So, to obtain the first principle component, which maximizes the objective function, we just need the eigenvector corresponding to the largest eigenvalue of $\mathbf{S}$. Subsequently, the second principal component is the eigenvector corresponding to the second largest eigenvalue, and so on.

### Principal component extraction using singular value decomposition

Singular Value Decomposition (SVD), is a well-known way to decompose any kind of matrix $\mathbf{A}$ (m by n) into three useful matrices.

$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$.

where $\mathbf{U}$ is m by m unitary matrix, $\mathbf{UU^T} = I_m$, each column of $\mathbf{U}$ is an eigenvector of $\mathbf{AA^T}$

$\mathbf{\Sigma}$ is m by n diagonal matrix, non-zero elements of this matrix are square roots of eigenvalues of $\mathbf{AA^T}$.

$\mathbf{V}$ is n by n unitary matrix, $\mathbf{VV^T} = I_n$, each column of $\mathbf{V}$ is an eigenvector of $\mathbf{A^TA}$

Now, comparing the concepts of PCA and SVD, one may find out that we can perform PCA using SVD.

Let's construct a matrix p by n with our n data points such that each column of this matrix represent one data point in p-dimensional space:

$\mathbf{X} = [\mathbf{x_1} \mathbf{x_2} .... \mathbf{x_n} ]$

and make another matrix $\mathbf{X}^*$ simply by subtracting the mean of data points from $\mathbf{X}$.

$\mathbf{X}^* = \mathbf{X} - \mu_X[1, 1, ... , 1], \mu_X = \frac{1}{n} \sum_{i=1}^n x_i$

Then we will get a zero-mean version of our data points for which $\mathbf{X^*} \mathbf{X^{*^T}}$ is the covariance matrix.

So, $\mathbf{\Sigma}$ and $\mathbf{U}$ give us the eigenvalues and corresponding eigenvectors of covariance matrix, respectively. We can then easily extract the desired principal components.

### MATLAB example

In this example we use different pictures of a man's face with different facial expressions. But, these pictures are noisy. So the face is not easily recognizable. The data set consists of 1965 pictures each 20 by 28. So dimensionality is 20*28= 560.

Our goal is to remove noise using PCA and of course by means of SVD function in matlab. We know that noise is not the main feature that makes these pictures look different, so noise is among least variance components. We first extract the principal components and then remove those which correspond to the least values of eigenvalues of covariance matrix. Here, eigenvalues are sorted in a descending order from column to column of matrix S. For example, the first column of matrix U, which corresponds to the first element in S, is the first principal component.

We then reconstruct the picture using first d principal components. If d is too large, we can not completely remove the noise. If it is too small, we will lose some information from the original data. For example we may lose the facial expression (smile, sadness and etc.). We can change d until we achieve a reasonable balance between noise reduction and information loss.

File:noisy-face.jpg
Noise reduced version of a picture in MATLAB example
 >> % loading the noisy data, file "noisy" stores our variable X which contains the pictures
>>
>>
>> % show a sample image in column 1 of matrix X
>> imagesc(reshape(X(:,1),20,28)')
>>
>> % set the color of image to grayscale
>> colormap gray
>>
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs
>> [U S V] = svd(X);
>>
>> d= 10;
>>
>> % reconstruct X using only the first d principal components and removing noise
>> XX = U(:, 1:d)* S(1:d,1:d)*V(:,1:d)' ;
>>
>> % show image in column 1 of XX which is a noise reduced version
>> imagesc(reshape(XX(:,1),20,28)')


## PCA continued, Lagrange multipliers, singular value decomposition (SVD) (Lecture 2: Sept. 15, 2014)

### Principal component analysis (continued)

PCA is a method to reduce dimensions in data or to extract features from data.

Given a data point in vector form $\,x$ the goal of PCA is to map $\,x$ to $\,y$ where $\,x$ $\,\isin$ $\,\real$d and $\,y$ $\,\isin$ $\,\real$p such that p is much less than d. For example we could map a two-dimensional data point onto any one-dimensional vector in the original 2D space or we could map a three-dimensional data point onto any 2D plane in the original 3D space.

The transformation from d-dimensional space to p-dimensional space is chosen in such a way that maximum variance is retained in the lower dimensional subspace. This is useful because variance represents the main differences between the data points, which is exactly what we are trying to capture when we perform data visualization. For example, when taking 64-dimensional images of hand-written digits, projecting them onto 2-dimensional space using PCA captures a significant amount of the structure we care about.

In terms of data visualization, it is fairly clear why PCA is important. High dimensional data is sometimes impossible to visualize. Even a 3D space can be fairly difficult to visualize, while a 2D space, which can be printed on a piece of paper, is much easier to visualize. If a higher dimensional dataset can be reduced to only 2 dimensions, it can be easily represented by plots and graphs.

In the case of many data points $\,x_i$ in d-dimensional space:

$\,X_{dxn} = [x_1 x_2 ... x_n]$

There is an infinite amount of vectors in $\,\real$p to which the points in X can be mapped. When mapping these data points to lower dimensions information will be lost. To preserve as much information as possible the points are mapped to a vector in $\real$p which will preserve as much variation in the data as possible. In other words, the data is mapped to the vector in $\real$p that will have maximum variance. In the images below, the data points mapped onto u in Mapping 1 have greater variance than in Mapping 2, and thus preserves more information in the data.

Notice(Figure 1.) the projection of a single data point x on u is p = uTx (a scalar), thus the projection of multiple data points(Figure 2.) onto u is computed by Y = uTX, where X is the matrix representing multiple data points. The variance of Y can be calculated easily as Y is a 1 by n vector.

Expand to higher dimensional space $\,\real$q where q > 1, Covariance is used to find the variance of Y. The formula is given by: Var(uTX) = uTSu where S is the sample covariance matrix of X.

In finding the first principal component the objective is to find the direction u which will have the greatest variance of the points projected onto it. This can be expressed as the following optimization problem:

maxu uTSu
s.t. uTu = 1

The restriction uTu = 1 is to bound the problem. If the length of u were not restricted, uTSu would go to infinity as the magnitude of u goes to infinity. The selection of 1 as the bound is arbitrary.

Also from the three images above, we can see that the first principal component is the "line of best fit" that passes through the multi-dimensional mean of all data points, that minimizes the sum of distance squared from data points to the vector. Tho not illustrated but if the above example is doing dimensional reduction from R3 to R1, then the second principal component will have the same fashion as first component, with correlation between the first component and data points removed.

### Lagrange multipliers

The Lagrange multiplier is a method for solving optimization problems. After constraining the principal component analysis output vector to an arbitrary size the task of maximizing variance in lower dimensional spaces becomes an optimization problem.

#### Lagrange multipler theorem

The method of Lagrange multipliers states that given the optimization problem:

   $max\ f(x,y)$
$s.t.\ g(x,y) = c$


The Lagrangian is defined by:

$\ L(x,y,\lambda)= f(x,y) - \lambda(g(x,y)-c)$

,where $\ \lambda$ term may be either added or subtracted.

Which implies, where $\bigtriangledown\$ is the gradient, that:

$\bigtriangledown\ f(x,y) = \lambda \bigtriangledown g(x,y)$

$\bigtriangledown\ f(x,y) - \lambda \bigtriangledown g(x,y) = 0$

In practice,

$\bigtriangledown\ L$ is calculated by computing the three partial derivatives of $\ L$ with respect to $\ x, y, \lambda$, setting them to 0, and solving the resulting system of equations.

That is to say, the solution is given by taking the solution of:

$\ \bigtriangledown L = \begin{bmatrix} \frac{\partial L}{\partial x}\\\frac{\partial L}{\partial y}\\\frac{\partial L}{\partial \lambda}\end{bmatrix}= 0$

which maximize the objective function f(x,y).

#### Intuitive explanation

The Lagrange multiplier utilizes gradients to find maximum values subject to constraints. A gradient $\bigtriangledown$ is a vector which collects all of the partial derivatives of a function. Intuitively, this can be thought of as a vector which contains the slopes of all of a graph's tangent lines, or the direction which each point on the graph points.

Finding a point where the gradient of f(x,y) and g(x,y) differ by a factor lambda: $\bigtriangledown f(x,y) = \lambda \bigtriangledown g(x,y)$ defines a point where our restraint and objective functions are both pointing in the same direction. This is important because if our objective function is pointing in a higher direction than g, increasing our objective function will increase our output, meaning we are not at a maximum. If the restraint is pointing in a higher direction than f, then reducing our input value will yield a higher output. The factor lambda is important because it is adjusted to ensure that the constraint is satisfied and maximized at our current location.

#### Lagrange multiplier example

Consider the function $\ f(x,y) = x - y$ such that $\ g(x,y) = x^2 + y^2 = 1$. We would like to maximize this objective function.

Calculating the Lagrangian gives

$\ \text{L}(x,y,\lambda) = x - y - \lambda g(x,y) = x - y - \lambda (x^2 + y^2 - 1)$

This gives three partial derivatives:

$\frac{\partial L}{\partial x} = 1 - 2\lambda x = 0, \frac{\partial L}{\partial y} = -1 - 2\lambda y = 0, \frac{\partial L}{\partial \lambda} = - (x^2 + y^2 - 1) = 0$

This gives two solutions:

1. $x = \frac{\sqrt{2}}{2}, y = \frac{-\sqrt{2}}{2}, \lambda = \frac{1}{\sqrt{2}}$
2. $x = \frac{-\sqrt{2}}{2}, y = \frac{\sqrt{2}}{2}, \lambda = \frac{1}{\sqrt{2}}$

Both solutions give extreme values of our objective function. Because this is an optimization problem we are interested in the solution which gives the maximum output, which is solution 1.

### Singular value decomposition revisited

Any matrix X can be decomposed to the three matrices: $\ U \Sigma V^T$ where:

$\ U$ is an m x m matrix of the eigenvector XXT

$\ \Sigma$ is an m x n matrix containing singular values along the diagonal and zeroes elsewhere

$\ V^T$ is an orthogonal transpose of a n x n matrix of the eigenvector XTX

#### Connection of Singular Value Decomposition to PCA

If $X$ is centered (has a mean of 0) then $XX^T$ is the covariance matrix.

Recall that for a square, symmetric matrix $A$, we can have $A = W \Lambda W^T$ where columns of $W$ are eigenvectors of $A$, and $\Lambda$ is a diagonal matrix containing the eigenvalues.

Using the SVD of $X$ we get $X=U \Sigma V^T$. Then consider $XX^T$:

$\ XX^T = (U \Sigma V^T)(U \Sigma V^T)^T$

$\ XX^T = (U \Sigma V^T)(V \Sigma U^T)$

$\ XX^T = U \Sigma V^T V \Sigma U^T$

$\ XX^T = U \Sigma I \Sigma U^T$ (By the definition of V)

$\ XX^T = U \Sigma^2 U^T$

Thus $U$ is the matrix containing eigenvectors of the covariance matrix $XX^T$.

It is important to remember that $U$ and $V$ are unitary matricies, i.e.: $U^TU=UU^T=I$ and $V^TV=VV^T=I$, each equality using the appropriate dimensional identity matrix. It is also interesting that the nonsquare matrix formed by taking a subset of the eigenvectors contained in $U$ or $V$ has the property that $V^TV=I$ (the appropriate lower dimensional identity matrix).

### Matlab example of dimensionality reduction using PCA

The following example shows how principal component analysis can be used to identify between a handwritten 2 and a handwritten 3. In the example the original images are each 64 pixels and thus 64 dimensional. By reducing the dimensionality to 2D the 2's and 3's are visualized on a 2D plot and can it could be estimated as to which digit one was.

 >> % load file containing 200 handwritten 2's and 200 handwritten 3's
>>
>> who?
>> % displays that X is 64x400 matrix containing the handwritten digits
>>
>> imagesc(reshape(X(:,1),8,8)')
>> colormap gray
>> % displays first image, a handwritten 2
>>
>> M = mean(X,2)*ones(1,400);
>> Xb = X-M
>> % M is the matrix of the mean of column vectors repeated 400 times
>> % Xb is data centred about mean
>>
>> % Now use SVD to find eigenvectors, eigenvalues of Xb
>> [U S V] = svd(Xb);
>>
>> % 2D projection on points
>> Y = U(:,1:2)'*X;
>>
>> % 2D projections of 2's
>> plot(Y(1,1:200),Y(2,1:200),'.')
>>
>> % 2D projections of 3's
>> plot(Y(1,201:400),Y(2,201:400),'.')


## PCA Algorithm, Dual PCA, and Kernel PCA (Lecture 3: Sept. 17, 2014)

### PCA Algorithm (Algorithm 1)

Recover basis
To find the basis for PCA we first center the data by removing the mean of the sample, then calculate the covariance:
$XX^T = \sum_{i=1}^n x_i x_i^T$
Let $U$ be the eigenvectors of $XX^T$ corresponding to the top $\,d$ eigenvalues.
Encode training data
To encode our data using PCA we let
$\,Y = U^TX$
where $Y$ is an encoded matrix of the original data
Reconstruct training data
To project our data back to the higher dimension
$\hat{X} = UY = UU^T X$
Encode test example
$\,y = U^T x$
where $\,y$ is an encoding of $\,x$
Reconstruct test example
$\hat{x} = Uy = UU^T x$

### Dual principal component analysis

Motivation

In some cases the dimension of our data can be much larger than the number of data points $( d \gt \gt n )$. For example, suppose we are interested in weekly closing rates of the Dow Jones Industrial Average, S&P 500 Index, and the NASDAQ Composite Index over the past forty years. In this example we only have three data points $( n = 3 )$, but our dimension is over two thousand! $( d \approx 2080 )$ Consequently, the matrix $XX^T$ used to find the basis $U$ is $2080\times2080$, which is computationally heavy to calculate. As another example is studying genes for patients since the number of genes studied is likely much larger than the number of patients. When the data's dimensionality is high relative to the number of data points collected, a less compute-intensive way of recovering the basis is desirable. Calculating dual PCA is more efficient than PCA when the number of data points is lower relative to the dimensionality of the data. This is intuitive because it allows us to model our optimization problem as a decomposition of our data, rather than our dimensions.

Recall from Lecture 2 how we used singular value decomposition for PCA:
$\,X = U \Sigma V^T$
Multiply both sides by V to get:
$\,X V = U \Sigma V^T V$
Since $V^T V = I$ (the identity matrix) the SVD can be rearranged as
$\,X V = U \Sigma$
Then, multiply both sides by $\Sigma$-1 to get:
$\,U = X V \Sigma$-1
And so $U$ can be calculated in terms of $V$ and $\Sigma$. $V$ is the eigenvectors of $X^T X_{n \times n}$, so it is much easier to find than $U$ because $n\lt \lt d$.
Algorithm
We can replace all instances of $\,U$ in Algorithm 1 with the dual form, $\,U = X V \Sigma^{-1}$, to obtain the algorithm for Dual PCA.
Recover Basis:
For the basis of Dual PCA, we calculate
$\,XX^T$
then let $\,V$ be the eigenvectors of $\,XX^T$ with respect to the top d eigenvalues. Finally we let $\,\Sigma$ be the diagonal matrix of square roots of the top d eigenvalues.
Encode Training Data:
Using Dual PCA to encode the data, let
$\,Y = U^TX = \Sigma V^T$
where $Y$ is an encoded matrix of the original data
Reconstruct Training Data:
Project the data back to the higher dimension by
$\hat{X} = UY = U \Sigma V^T = X V \Sigma^{-1} \Sigma V^T = X V V^T$
Encode Test Example:
$\,y = U^T x = \Sigma^{-1} V^T X^T x = \Sigma^{-1} V^T X^T x$
where $\,y$ is an encoding of $\,x$
Reconstruct Test Example:
$\hat{x} = Uy = UU^T x = X V \Sigma^{-2} V^T X^T x = X V \Sigma^{-2} V^T X^T x$
Note that the steps of Reconstructing training data and Reconstructioning test example still depend on $d$, and therefore still will

be impractical in the case that the original dimensionality of the data ($d$) is very large

### Kernel Principal Component Analysis

Kernel methods

Kernel methods are a class of algorithms used for analyzing patterns and measuring similarity. They are used for finding relations within a data set. Kernel methods map data into a higher-dimensional space to make it easier to find these relations. While PCA is designed for linear variabilities, a lot of higher-dimensional data are nonlinear. This is where Kernel PCA can be useful to help model the data variability when it comes to a nonlinear manifold. Generally speaking, Kernels is a measure of "similarity" between data points without explicitly performing the dot product between data points pairs. This can be done using different methods. For example, the dot product looks at whether vectors are orthogonal. The Gaussian kernel looks at whether two objects are similar on a scale of 0 to 1. Kernels also measure similarity between trees. We will not go in too much detail at the moment.

Some common examples of kernels include:
1. Linear kernel:
$\, k_{ij} = \lt x_{i},x_{j}\gt$
2. Polynomial kernel:
$\, k_{ij} = (1+\lt x_{i},x_{j}\gt )^P$
3. Gaussian kernel:
$\, k_{ij} = exp(||x_{i}-x_{j}||^2/2\sigma^2)$

Where $\ x_{i},x_{j} ,\isin$ $\,\real$d

The kernel matrix $\, K_{n \times n}$ is the matrix where $\, k_{ij}$ is the chosen kernel between d-dimensional points $\ x_{i}$ and $\ x_{j}$

Other kernel examples can be found here:
Kernel PCA
With the Kernel PCA, we take the original or observed space and map it to the feature space and then map the feature space to the embedded space of a lower dimension.
$X_{observed space} \rightarrow \Eta_{feature space} \rightarrow Y_{embedded space}$
In the Kernel PCA algorithm, we can use the same argument as PCA and again use the Singular Value Decomposition (SVD) where:
$\, \Phi(X) = U \Sigma V^T$
where $U$ contains the eigenvectors of $\Phi(X)\Phi(X)^T$
The goal is to reduce the dimensionality from the space X to the dimensionality of space Y by passing through H without having to know $\Phi(X)$ exactly.
The algorithm is similar to the Dual PCA algorithm except that the training and the test data cannot be reconstructed since $\Phi(X)$ is unknown.
Algorithm
Recover Basis:
For the basis of Kernel PCA, we calculate
$\, K = \Phi(X)^T \Phi(X)$
using the kernel $K$ and let $V$ be the eigenvectors of $\Phi(X)^T \Phi(X)$ with respect to the top $p$ eigenvalues. Finally we let $\Sigma$ be the diagonal matrix of square roots of the top $p$ eigenvalues.
Encode Training Data:
Using Kernel PCA to encode the data, let
$\,Y = U^T \Phi(X) = \Sigma V^T$
Recall $U = \Phi(X) V \Sigma^{-1}$:
$\,Y = U^T \Phi(X) = \Sigma^{-1} V^T \Phi(X)^T \Phi(X) = \Sigma^{-1} V^T K(X,X)$
where $Y$ is an encoded matrix of the image of the original data in the function $\Phi$. $Y$ is computable without knowledge of $\Phi(X)$ via the kernel function.
Reconstruct Training Data:
$\hat{\Phi(X)} = UY = UU^T\Phi(X)=\Phi(X)V\Sigma^{-1}\Sigma V^T=\Phi(X) V V^T$ $\hat{X} = \Phi^{-1}(\hat{\Phi(X)})$
However, $\Phi(X)$ is unknown so we cannot reconstruct the data.
Encode Test Example:
$\,y = U^T \Phi(x) = \Sigma^{-1} V^T \Phi(X)^T \Phi(x) = \Sigma^{-1} V^T K(X,x)$
This is possible because we know that $\Phi(X)^T \Phi(X) = K(X,x)$.
Recall
$U = \Phi(X) V \Sigma^{-1}$:
$\Phi(X)$ is the transformation of the training data $X_{d \times n}$ while $\Phi(x)$ is the transformation of the test example $x_{d \times 1}$

Reconstruct Test Example:
$\hat{\Phi(x)} = Uy = UU^T \Phi(x) = \Phi(X) V \Sigma^{-2} V^T \Phi(X)^T \Phi(x) = \Phi(X) V \Sigma^{-2} V^T K(X,x)$
Once again, since $\Phi(X)$ is unknown, we cannot reconstruct the data.

## Centering for kernel PCA and multidimensional scaling (Lecture 4: Sept. 22, 2014)

### Centering for kernel PCA

In the derivation of kernel PCA we assumed that $\sum_{i=1}^n\Phi(x_i)=0$ which is improbable for a random data set and an arbitrary mapping $\Phi$. We must be able to ensure this condition though we don't know $\Phi$.

Let $\mu=\frac{1}{n}\sum_{j=1}^n\Phi(x_j)$

Define the centered transformation (not computable)

$\tilde{\Phi}:x\mapsto(\Phi(x)-\mu)$

and the centered kernel (computable)

$\tilde{K}:(x,y)\mapsto \tilde{\Phi}(x)^T \tilde{\Phi}(y)$

Then

$\tilde{K}(x,y) = (\Phi(x)-\mu)^T (\Phi(y)-\mu)=\Phi(x)^T\Phi(y)-\mu^T\Phi(y)-\Phi(x)^T\mu+\mu^T\mu$

$=K(x,y)-\frac{1}{n}\sum_{j=1}^n\Phi(x_j)^T\Phi(y)-\frac{1}{n}\sum_{j=1}^n\Phi(x_j)^T\Phi(x)+\frac{1}{n^2}\sum_{j=1}^n\sum_{i=1}^n\Phi(x_j)^T\Phi(x_i)$
$=K(x,y)-\frac{1}{n}\sum_{j=1}^n K(x_j,x)-\frac{1}{n}\sum_{j=1}^n K(x_j,y)+\frac{1}{n^2}\sum_{j=1}^n\sum_{i=1}^n K(x_j,x_i)$
$=K(x,y)-\frac{1}{n}\bold{1}^TK(X,x)-\frac{1}{n}\bold{1}^TK(X,y)+\frac{1}{n^2}\bold{1}^TK(X,X)\bold{1}$
where $\bold{1}=([1,1,...,1]_{1\times n})^T$

Here $X_{d\times n}$ and $\{x_j\}_{j=1}^n$ denotes the data set matrix and elements respectively and $x,y$ represent test elements at which the centered kernel is evaluated.

In practice, with $\, X = [x_1 | x_2 | ... | x_n]$ as the data matrix, the Kernel Matrix, denoted by $K$, will have elements $\, [K]_{ij} = K(x_i,x_j)$.

We evaluate $\tilde{K} = K - \frac{1}{n}\bold{1}^TK-\frac{1}{n}K\bold{1}+\frac{1}{n^2}\bold{1}^TK\bold{1}$, and find the eigenvectors/eigenvalues of this matrix for Kernel PCA.

### Metric Multidimensional Scaling

Multidimensional Scaling (MDS) is another dimension reduction method that attempts to preserve the pair-wise distance, $\,d_{ij}$, between data points. MDS helps with the problem of making a configuration of points in Euclidean space by using the distance information between the patterns. For example, suppose we want to map a set of points X, $x \in \mathbb{R}^d$, to a set of points Y, $y \in \mathbb{R}^p$. We want to preserve the pair-wise distances such that $\,d(x_i, x_j)$ is approximately the same as $\,d(y_i, y_j)$.

#### MDS example

We can map points in 2 dimensions to points in 1 dimension by projecting the points onto a straight line:

Note that the pair-wise distances are relatively preserved. However, the triangle inequality is not preserved:

$\, d(x_1,x_2)+d(x_2,x_3) \gt d(x_1, x_3)$, but $\, d(y_1,y_2)+d(y_2,y_3) = d(y_1, y_3)$

Another example of when it's crucial to preserve the distances between data points in lower dimension space (MDS) is when dealing with "distances between cities", as shown in here

#### Distance matrix

Given a $d x n$ matrix, $X$, the distance matrix is defined by:

$D^X_{n \times n} = \begin{bmatrix} d^{(X)}_{11} & d^{(X)}_{12} & ... & d^{(X)}_{1n}\\d^{(X)}_{21} & d^{(X)}_{22} & ... & d^{(X)}_{2n}\\...\\d^{(X)}_{n1} & d^{(X)}_{n2} & ... & d^{(X)}_{nn} \end{bmatrix}$

Where $d^{(X)}_{ij}$ is defined as the euclidean distance between points $x_i$ and $x_j$ in $\mathbb{R}^d$

Important properties:

• The matrix is symmetric
• ie $d^{(X)}_{ij} = d^{(X)}_{ji} \; \forall \; 1 \leq i,j \leq n$
• The matrix is always non-negative
• ie $d^{(X)}_{ij} \ge 0 \; \forall \; 1 \leq i,j \leq n$
• The diagonal of the matrix is always 0
• ie $d^{(X)}_{ii}= 0 \; \forall \; 1 \leq i \leq n$
• If $x_i$'s don't overlap, then all elements that are not on the diagonal are positive
• Triangle inequality holds for all distance
• Triangle inequality: $d_{ab} + d_{bc} \gt = d_{ac}$ whenever $a, b, c$ are three different indices

Example:
For:

$x_1 = (0,1)$
$x_2 = (1,0)$
$x_3 = (1,1)$

Or:

$X = \begin{bmatrix} 0 & 1 & 1\\1 & 0 & 1 \end{bmatrix}$

The distance matrix is:

$D^X_{3 \times 3} = \begin{bmatrix} 0 & \sqrt{2} & 1\\\sqrt{2} & 0 & 1\\1 & 1 & 0 \end{bmatrix}$

For any mapping to matrix $Y_{p \times n}$, we can calculate the corresponding distance matrix

$D^Y_{n \times n} = \begin{bmatrix} d^{(Y)}_{11} & d^{(Y)}_{12} & ... & d^{(Y)}_{1n}\\d^{(Y)}_{21} & d^{(Y)}_{22} & ... & d^{(Y)}_{2n}\\...\\d^{(Y)}_{n1} & d^{(Y)}_{n2} & ... & d^{(Y)}_{nn} \end{bmatrix}$

Where $d^{(Y)}_{ij}$ is defined as the euclidean distance between points $y_i$ and $y_j$ in $\mathbb{R}^p$

Note that a kernel matrix is a positive semidefinite matrix.

#### Finding the optimal Y

MDS minimizes the metric

$\text{min}_Y \sum_{i=1}^n{\sum_{j=1}^n{(d_{ij}^{(X)}-d_{ij}^{(Y)})^2}}$

Where $d_{ij}^{(X)} = ||x_i - x_j||$, and $d_{ij}^{(Y)} = ||y_i - y_j||$

The distance matrix $D^{(X)}$ can be converted to a kernel matrix:

$K = -\frac{1}{2}HD^{(X)}H$

Where $H = I - \frac{1}{n}ee^T$ and $e=\begin{bmatrix} 1 \\ 1 \\ ... \\ 1 \end{bmatrix}$

Here, K is a kernel matrix implies that K is positive semi-definite, so $x^TKx \geq 0$ $\forall x$

In fact, there is a one-to-one correspondence between symmetric matrices and kernel matrices

#### Theorem

If $D^{(X)}$ is a distance matrix and if $K=-\frac{1}{2}HD^{(X)}H$, then $D^{(X)}$ is Euclidean if and only if $K$ is Positive Semi-Definite,
where $H=I-\frac{1}{n}\bold{e}\bold{e} ^T$ and $\bold{e}=([1,1,...,1]_{1\times n})^T$

#### Relation between MDS and PCA

It has been shown that the goal of MDS is to obtain $Y$ by minimizing $\text{min}_Y \sum_{i=1}^n{\sum_{j=1}^n{(d_{ij}^{(X)}-d_{ij}^{(Y)})^2}}$
K is positive semi-definite so it can be written as $K = X^TX$. Thus

$\text{min}_Y \sum_{i=1}^n{\sum_{j=1}^n{(d_{ij}^{(X)}-d_{ij}^{(Y)})^2}} =\text{min}_Y(\sum_{i=1}^n\sum_{j=1}^n(x_i^Tx_j-y_i^Ty_j)^2)$

and

$\text{min}_Y(\sum_{i=1}^n\sum_{j=1}^n(x_i^Tx_j-y_i^Ty_j)^2)=\text{min}_Y(\text{Tr}((X^TX-Y^TY)^2))$

From SVD we may make the decompositions $\, X^TX=V\Lambda V^T$ and $Y^TY=Q\hat{\Lambda}Q^T$

$\text{min}_Y(\text{Tr}((X^TX-Y^TY)^2))=\text{min}_Y(\text{Tr}((V\Lambda V^T-Q\hat{\Lambda}Q^T)^2))=\text{min}_Y(\text{Tr}((\Lambda-V^TQ\hat{\Lambda}Q^TV)^2))$

Let $G=Q^TV$

$\text{min}_{G,\hat{\Lambda}}(\text{Tr}((X^TX-Y^TY)^2))=\text{min}_{G,\hat{\Lambda}}(\text{Tr}((\Lambda-G^T\hat{\Lambda}G)^2))=\text{min}_{G,\hat{\Lambda}}(\text{Tr}(\Lambda^2+G^T\hat{\Lambda}GG^T\hat{\Lambda}G-2\Lambda G^T\hat{\Lambda}G))$

$=\text{min}_{G,\hat{\Lambda}}(\text{Tr}(\Lambda^2+G^T\hat{\Lambda}^2G-2\Lambda G^T\hat{\Lambda}G))$
It can be shown that for fixed $\hat{\Lambda}$ that the minimum over $G$ occurs at $G=I$, which implies that

$\text{min}_{G,\hat{\Lambda}}(\text{Tr}((X^TX-Y^TY)^2))=\text{min}_{\hat{\Lambda}}(\text{Tr}(\Lambda^2-2\Lambda\hat{\Lambda}+\hat{\Lambda}^2)=\text{min}_{\hat{\Lambda}}(\text{Tr}((\Lambda-\hat{\Lambda})^2)$

Since $\, \Lambda$ and $\hat{\Lambda}$ are diagonal matrices containing the eigenvalues of $X^TX,\text{ and } Y^TY$ respectively, and all said eigenvalues are positive, then the min is clearly obtained when the $p$ nonzero diagonal elements in $Y^TY$ are the $p$ largest eigenvalues in $X^TX$

Therefore in Euclidean space, the solution of MDS is the same as that of PCA.

Approximation to MDS Algorithm <ref> http://www.cmap.polytechnique.fr/~peyre/cours/x2005signal/dimreduc_landmarks.pdf </ref>

The problem with the MDS algorithm is that the matrices $D^X$ and $K$ are not sparse. It is therefore expensive to compute eigen-decompositions when the number of data points is very large. To reduce the computational work required we can use Landmark Multidimensional Scaling (LMDS). LMDS is an algorithm designed to run classical MDS to embed a chosen subset of data, landmark points, and then located the remaining data points within this space given knowledge of its distance to the landmark points. The following paper has details on FastMap, MetricMap, and Landmark MDS and unifies the mathematical foundation of the three presented MDS algorithms.

## ISOMAP and LLE (Lecture 5: Sept. 29, 2014)

### ISOMAP

Isomap is a nonlinear dimensionality reduction method. In Multidimensional scaling we seek to find a low dimensional embedding $\,Y$ for some high dimensional data $\,X$, minimizing $\,\|D^X - D^Y\|^2$ where $\, D^Y$ is a matrix of euclidean distances of points on $\,Y$ (same with $\,D^X$). In Isomap we assume the high dimensional data in $X$ lie on a low dimensional manifold, and we'd like to replace the euclidean distances in $\, D^X$ with the distance on this manifold. Obviously we don't have information about this manifold, so we approximate it by forming the k-nearest-neighbour graph between points in the dataset. The distance on the manifold can then be approximated by the shortest path between points. This is called the geodesic distance which is a generalization of the notion of straight line to curved spaces ans denoted by $\, D^G$. This is useful because it allows for nonlinear patterns to be accounted for in the lower dimensional space. From this point the rest of the algorithm is the same as MDS. The approach of Isomap is capable of discovering nonlinear degrees of freedom that underlie the complex nature observation. The efficiency here compute a global optimal solution and thus converged asymptotically to the true structure( also see the article: A Global Geometric Framework for Nonlinear Dimensionality Reduction, by Joshua B. Tenenbaum, Vin de Silva, John C. Langford )

The Isomap Algorithm can thus be described through the following steps:

1. Construct the Neighbourhood Graph

Determine which points are neighbours on manifold M based on distances $d_{i,j}^{(X)}$ between pairs of points i and j in the input space X. To do this, can use:
a) K-nearest neighbours
b) ${\epsilon}$ radius neighbourhood
The neighbourhood relations are represented as a weighted graph G over the data points, with edges of weight $d_{i,j}^{(X)}$ between the neighbouring points.

2. Compute shortest paths

Estimate geodesic distances $d_{i,j}^{(M)}$ between all pairs of points on the manifold M by computing their shortest path distance $d_{i,j}^{(G)}$ in graph G.

3. Construct low dimensional embeddings (classical MDS)

Apply classical MDS to the matrix of graph distances $D_{G} = \{ d_{i,j}^{(G)} \}$, constructing an embedding of the data in a d-dimensional Euclidean space Y that best represents the manifolds estimated intrinsic geometry.

It is worth noting that as $D^{G}$ is not a Euclidean distance, the matrix $K = -\frac{1}{2}HD^{(G)}H$ is not guaranteed to be positive semi-definite. This is handled in the implementation [5] by considering only the positive value of the real part only, which can be considered as projecting the matrix onto the cone of nearest positive semi-definite matrix.

It is important to note that k determines "locality". Isomap captures local structure and "unfolds" the manifold.

The swiss roll manifold is a commonly discussed example of this. Note that the Euclidean distance between points A and B drastically differs from the Geodesic distance.

#### ISOMAP Results

A Global Geometric Framework for Nonlinear Dimensionality Reduction is the paper that proposed ISOMAP, the below figures show some images for ISOMAP results. The first image shows the different orientation of the face images dataset, as shown in the figure the x axis captures the horizontal orientation, while the y axis captures the vertical orientation and the bar captures the lightning direction. The second image shows the 2 handwritten digits and ISOMAP results in this dataset.

### Locally Linear Embedding (LLE)

Locally Linear Embedding (LLE) is another technique for nonlinear dimensionality reduction that tries to find a mapping from high dimensional space to low dimensional space such that local distances between points are preserved. LLE has some advantages over Isomap, including faster optimization and better result in many problems. LLE reconstructs each point based on its k-nearest neighbour.

E.g. $x_i \approx w_1x_{1}+w_2x_{2}+...+w_kx_{k} , \sum_i w_i = 1$

The method optimizes the equation $\mathbf{x_i-w_1x_{i1}+w_2x_{i2}+...+w_kx_{ik}}$ for each $x_i \in X$ with corresponding k-th nearest neighbour set $\mathbf{\, {x_{i1},x_{i2},...,x_{ik}}}$. LLE is therefore an optimization problem for:

$\mathbf{ \text{min}_W \sum_{i=1}^n{(x_i-\sum_{j=1}^n{w_{ij}x_j)^2}}}$ such that $\mathbf{\sum_{j=1}^n w_{ij} = 1}$ and also $\mathbf{w_{ij} = 0}$ if $\mathbf{x_i}$ and$\mathbf{x_j}$are not neighbors

#### Algorithm

• Find the k-nearest neighbour for each data point
• Compute $\mathbf{W_{nxn}}$ through $\mathbf{\text{min}_W \sum_{i=1}^n{(x_i-\sum_{j=1}^n{W_{ij}x_j)^2}}}$ such that $\mathbf{\sum w_{ij} = 1}$ (which means that it's convex optimization of the points) where $\mathbf{x_j}$ are the k-nearest neighbours of $\mathbf{x_i}$
• Let $\mathbf{V_i = [x_j | x_j}$ is a neighbour of $\mathbf{ x_i ]_{dxk}}$ Let $\mathbf{w(i)}$ be all k weights of point $\mathbf{x_i}$ (ie row i of W with all zeroes removed) Then:
$\mathbf{V_i*w(i) = \sum_{j=1}^n{w_{ij}x_j}}$
and hence we need to optimize
$\mathbf{\, \text{min}_{w(i)} |x_i-V_i*w(i)|^2}$ under the constraint $\mathbf{\sum_{j=1}^n{w_{ij}}=1}$
This is a separable optimization problem which can be solved for each point independently, which can be simplified to
$\mathbf{min|x_ie_k^Tw(i)-V_iw(i)|^2 }$, where $e=\begin{bmatrix} 1 \ 1 \ ... \ 1 \end{bmatrix}^T$.
Now we have
$\mathbf{\, |x_ie_k^Tw(i)-V_iw(i)|^2 =(x_ie^Tw(i)-V_iw(i))^T(x_ie^Tw(i)-V_iw(i))= w(i)^T(x_ie^T-V_i)^T(x_ie^T-V_i)w(i) }$

Let $\mathbf{\, G =(x_ie^T-V_i)^T(x_ie^T-V_i) }$. Therefore we can instead simplify
$\mathbf{ min \;w(i)^TGw(i)}$ under the constraint $\mathbf{\, w(i)^Te=1}$

• Use Lagrange Multipliers to optimize the above equation:
$\mathbf{\, L(w(i),\lambda)=w(i)^TGw(i)-\lambda(w(i)^Te-1) }$

$\mathbf{\frac{\partial L}{\partial {w(i)}}=2Gw(i)-\lambda e=0 }$ which implies $\mathbf{ Gw(i)=\frac{\lambda}{2}e}$ and hence $\mathbf{ w(i)=\frac{\lambda}{2}G^{-1}e }$
Note that we do not need to compute $\mathbf{\lambda}$, but can just compute $\mathbf{\, G^{-1}e}$, then re-scale so $\mathbf{\, w(i)^Te=1}$
• Finally, solve $\mathbf{\text{min}_Y \sum_{i=1}^n{(y_i-\sum_{j=1}^t{w_{ij}y_j)^2}}}$ where w is known and y is unknown

Notes:

$\, G_{d*d}$ is invertible under some conditions depending on $\, k$, as it must have rank $\, d$ so $\, k$ must be greater than or equal $\, d$

LLE doesn't work for out of sample case there are extensions for it to handle this as in this [6]

If we want to reduce the dimensionality to p, we need to choose the p+1 smallest eigenvectors.

Pros of Isomap:

1. Calculation is based on eigenvalue and eigenvectors, thus overall stability and optimality is maintained.
2. Only need one pre-determined parameter: k(nearest neighbours) or $\, {\epsilon}$ (neighbourhood radius)
3. Able to determine intrinsic dimension of signal from residual.

Cons of Isomap

1. It is slow - it requires comparing all possible paths between each N(N-1)/2 pairs of points and determining which is shortest. (But this can be fixed via the Nystrom approximation which we will see later.)
2. It is not obvious how to handle out of sample data

Pros of LLE:

1. LLE is capable of retain the structure and characterstic of original data.
2. LLE requires little parameters to calculate
3. Combining 1 & 2, LLE is good for trouble shooting and error dignostics, since it can retain most from original data and needs minimal amount of parameter to calculate.
4. Much faster than ISOMAP, often with comparable or better performance

Cons of LLE

1. It is not obvious how to handle out of sample data
2. There is no way of knowing if reducing the data to p dimensions is the optimal reduction.

## LLE continued, introduction to maximum variance unfolding (MVU) (Lecture 6: Oct. 1, 2014)

### LLE

Recall the steps of LLE can be summarized as follows:

1. For each data point, find its $k$ nearest neighbours for some fixed $k$.
2. Find reconstruction weights $W$, minimizing
$\sum_{i=1}^n ||x_i - \sum_{j=1}^n w_{ij}x_j||^2, where \sum_j w_{ij} = 1$
3. Reconstruct the low dimensional embedding with respect to these weights.
That is, find low dimensional $Y$ minimizing
$\sum_{i=1}^n||y_i - \sum_{j=1}^nw_{ij}y_j||^2$.

The previous lecture covered how to find the weights $W$. Here we will discuss how to find the low-dimensional embedding $Y$ using these weights.

First, we want to write the previous equation in a matrix form.
Consider that $Y$ is a $p*n$ matrix and $I_{:i}$ is the $i^{th}$ column ($n*1$) in the identity matrix (like Matlab notation)

Then the sample, $i$, can be represented as follows: $y_i = Y * I_{:i} = \begin{bmatrix} y_1, y_2 ,... y_i,..., y_n \end{bmatrix} \begin{bmatrix} 0 \\0 \\... \\ 1 \\ ... \\0 \end{bmatrix}$

Note that: $I_{i:} Y^T = y_i^T$ where $I_{i:}$ is the $i^{th}$ row from the identity matrix

Similarly, $\sum w_{ij} y_i = W_{i:} Y^T$

So the objective is equivalent to minimizing $\sum_{i=1}^n||I_{i:} Y^T - W_{i:}Y^T||^2 = ||IY^T - WY^T||^2$

But first, to prevent the trivial solution of the 0-matrix, we add the constraint that $\frac{1}{n}Y^TY = I$. Now, the objective can be rewritten as minimizing:

$||Y^T - WY^T||^2_F$

where the above norm is the Frobenius norm, which takes the square root of the sum of squares of all the entries in the matrix. Recall this norm can be written in terms of the trace operator:

$||Y^T - WY^T||^2_F = ||Y^T (I - W)||^2_F = Tr((Y^T (I - W)) (Y^T (I - W))^T)= Tr(Y^T(I-W)(I-W)^TY)$

By letting $\, M=(I-W)(I-W)^T$, the objective function will become

$\, \min_Y Tr(Y^T M Y)$

This minimization has a well-known solution: We can set the rows of $Y$ to be the $d$ lowest eigenvectors of $M$, where $d$ is the dimension of the reduced space.

To solve this minimization using the Lagrange multiplier, the Lagrangian is:

$L(Y, \Lambda) = Tr(Y^TMY) - Tr(\Lambda (\frac{1}{n} Y^T Y - I)))$

$\frac{d}{dy} L(Y, \Lambda) = 2MY- \Lambda \frac{2}{n} Y$

Setting this equal to zero and solving gives

$MY = \frac{1}{n} \Lambda Y$
$MY = \Lambda^* Y$

where $\Lambda = \begin{bmatrix} \lambda_1 & \ldots & 0 \\ 0 & \lambda_2 & \ldots\\ \vdots & \ddots & \vdots\\ 0 &\ldots & \lambda_p \end{bmatrix}$

If we want to reduce the dimensions to p, we need to choose p + 1 smallest eigenvectors. This is because the first eigenvalue of the matrix is equal to 1 since it is a Laplacian (the number of $0$ eigenvalues of a Laplacian is the number of connected components).

Note that for LLE, the bottom d+1 eigenvectors of the matrix M represents a free translation mode of eigenvalue 0. By discarding this eigenvector, we enforce the constraint of $\sum_{i=1}^n = y_i=0$ since it means that the components of the other eigenvectors must sum to 0, because of orthogonality. The remaining d eigenvectors form the d embedding coordinates found by LLE.

### Comparison between Dimensionality Reduction Approaches (Unified Framework)

Approach Solution \ Eigenvectors of
PCA $\boldsymbol{ X^T X }$ or $\boldsymbol{ \Sigma V^T }$
MDS $\boldsymbol{ - \frac{1}{2} (I - \frac{1}{n} e e^T) D (I - \frac{1}{n} e e^T) }$ a.k.a. $\boldsymbol{-\frac{1}{2} H D H}$ where $\boldsymbol{H=(I - \frac{1}{n} e e^T)}$
ISOMAP $\boldsymbol{ - \frac{1}{2} (I - \frac{1}{n} e e^T) D^G (I - \frac{1}{n} e e^T) }$
LLE $\boldsymbol{ (I - W)(I - W)^T }$ or $\boldsymbol{ \lambda I - (I - W)(I - W)^T }$
KPCA $\boldsymbol{K}$

From the above table we can interpret all of these algorithms as variations of kernel PCA. More details are presented in this paper: A kernel view of the dimensionality reduction of manifolds.

Note that PCA, MDS, ISOMAP and KPCA choose the largest eigenvalue/eigenvector pairs for embedding while LLE chooses the smallest pair(For MDS, D is Euclidean distance). However, in LLE, if we let $\lambda_{max}$ be the largest eigenvalue of $\mathbf{L = (I-W)^T(I-W)}$ and replace the matrix $\mathbf{ (I - W)(I - W)^T }$ with $\mathbf{ \lambda_{max}I - (I - W)(I - W)^T }$, the largest eigenvalue/eigenvector pairs will be chosen instead.

### Relationship between LLE and Kernel PCA on LLE kernel

The smallest eigenvalue of LLE Kernel $\,(I-W)^T(I-W)$ is 0. The eigenvector corresponding to this eigenvalue is $\,c[1,1,1,....1,1,1]$.

So when we do LLE we find eigenvectors corresponding to smallest p+1 eigenvalues and discard the first eigenvector which is $\,c[1,1,1,....1,1,1]$.

Now, If we construct a new matrix, a so-called deflated matrix by multiplying $\,\Phi(x)=I-W$ with a factor $\,(I-\frac{1}{n}ee^T)$, we will get a new kernel with rank d-1.

i.e. $\,\hat{K} = \hat{\phi(x)}^T \hat{\phi(x)} = (I-\frac{1}{n}ee^T)K(I-\frac{1}{n}ee^T)$ has the eigenvector $\,c[1,1,1,....1,1,1]$ removed.

Next we only need to find eigenvectors corresponding to smallest p eigenvalues and do not need to discard the first one, since it has been moved already.

However, we realize that $\,(I-\frac{1}{n}ee^T)$ equals the centering matrix $\,H$.

So what we do in LLE is similar to what we do in Kernel PCA.

In fact doing LLE is equivalent to perform Kernel PCA on LLE kernel up to a scaling factor.

Supervised Locally Linear Embedding

Dick et al (2003) proposed a supervised LLE. The big difference between LLE and SLLE is that in SLLE, the class information of data would be considered to get the nearest neighbors.

### Maximum variance unfolding

In maximum variance unfolding (MVU), as in isomap and LLE, we try to reduce the dimensionality of a nonlinear data. We may assume that the data is sampled from a low-dimensional manifold that is embedded within a higher-dimensional Euclidean space. The aim is to "unfold" this manifold in a lower-dimensional Euclidean space such that the local structure of each point is preserved. We would like to preserve the distance from any point to each of its $k$ nearest neighbours, while maximizing the pairwise variance for all points which are not its neighbours. The MVU algorithm uses semidefinite programming to find such a kernel.

#### Optimization constraints

For a kernel $\Phi$ such that $x \mapsto \phi(x)$, we have the following 3 constraints:

1. $\Phi$ must be centred, i.e. $\sum\limits_{i,j} k_{ij} = 0$.
2. $k$ should be positive semi-definite, i.e. $k \succeq 0$.
3. The local distance should be maintained between data and its embedding.
This means that for $i, j$ such that $x_i, x_j$ are neighbours, we require that $\Phi$ satisfies
$\, | x_i - x_j |^2 = |\phi(x_i) - \phi(x_j)|^2$

$\, = [\phi(x_i) - \phi(x_j)]^T [\phi(x_i) - \phi(x_j)]$
$\, =\phi(x_i)^T \phi(x_i) - \phi(x_i)^T \phi(x_j) - \phi(x_j)^T \phi(x_i) + \phi(x_j)^T \phi(x_j)$
$\, =k_{ii} - k_{ij} - k_{ji} + k_{jj}$
$\, =k_{ii} -2k_{ij} + k_{jj}$
where $\, K=\phi^T \phi$ is the $n \times n$ kernel matrix.
We can summarize this third constraint as:
$\, n_{ij} ||x_i - x_j||^2 = n_{ij} || \phi(x_i) - \phi(x_J) ||^2$
where $n_{ij}$ is the neighbourhood indicator, that is, $n_{ij} = 1$ if $x_i, x_j$ are neighbours and $0$ otherwise.

#### Optimization problem

One option for finding the correct K would be to minimize the rank of K. The intuition behind this is that even though the data might be high-dimensional, it lies on a manifold of low dimension. Since $rank$ is not a convex function, the objective function is ill-defined and is replaced with $Tr(K),$ since when K is semidefinite the trace is just equal to the sum of the singular values. However, in practice minimizing the trace does not result in a faithful low-dimensional representation. An explanation for this phenomenon is still an open problem.

Recall that in PCA, $Var(X) = \boldsymbol{\lambda}_1 + \boldsymbol{\lambda}_2 + \cdots + \boldsymbol{\lambda}_d = Tr(S)$. To find a useful K, the objective of $Tr(K)$ can still be used, but the problem has to be changed from a minimization to a maximization. The intuition here is that you are maximizing the total variance explained by the kernel. Thus the problem is

$\text{max } Tr(K)$
$\text{subject to above constraints}$

which is a semidefinite program.

#### MVU (Maximum Variance Unfolding) Method

1. Construct the neighbourhood graph using k-nearest neighbours
2. Find the kernel K by solving the SDP problem: max $Tr(K)$ such that
$K_{ii}+K_{jj}-2K_{ij} = D_{ij}, \sum_{ij} K_{ij} = 0, K \succeq 0$
3. Apply Kernel PCA on the learned kernel K

Maximum Variance Unfolding is a useful non-linear interpretation of the PCA intuition. MVU is robust and will preserve the nearest neighbours of the original data in any mapping.

To categorize dimension reduction methods we have learnt so far

Locality Preserving methods Globality Preserving methods
Linear methods LLE MDS, PCA
Non-linear methods ISOMAP, LPP(locality preserving projection) MVU, KPCA

## Action Respecting Embedding and Spectral Clustering (Lecture 7: Oct. 6, 2014)

### SDE Paper Notes

Unsupervised Learning of Image Manifolds by Semidefinite Programming introduced the maximum variance unfolding technique. The results of the paper compare the proposed technique against existing dimensionality reduction techniques. The first figure compares SDE against PCA in the "faces" dataset. To maintain the entire dataset variation we need 5 dimensions using SDE, while we need 10 dimensions using PCA. The second figure compares SDE against ISOMAP in a non-convex dataset. To maintain the whole dataset variation we need only 2 dimensions using SDE, while we need more than 5 dimensions using ISOMAP. The two figures show that SDE can represent data variation using a much smaller number of eigenvectors than PCA or ISOMAP. Also, if two points take the same action, the distance between theses two points should not change( this is important ). Rotation and transaction matrix are the only way to remain the distance.

SDE against ISOMAP showcases a known problem that ISOMAP can struggle against non-convex data.

### Action Respecting Embedding

When we are dealing with temporal data (data in which the order is important) we could consider a technique called Action Respecting Embedding. Examples include stock prices, which are time series, or a robot that moves around and takes pictures at different positions. Usually there are various actions that transform one data point into the next data point. For example, consider a robot taking pictures at different positions: actions include stepping forward, stepping backward, stepping right, turning, etc. After each action, the robot takes a picture at its current position, then undertakes another action, continuing until a set of data points is collected. Or consider a financial time series, where actions include stock splits, interest rate reduction, etc. Or consider a patient under a treatment, where actions include taking a specific step in the treatment or taking a specific drug, which could affect blood sugar level, cholesterol, etc.

Some information cannot be captured in the form of similarities.

Action Respecting Embedding takes a set of data points $\, x_{1},...,x_{n}$ along with associated discrete actions $\, a_{1},...,a_{n-1}$. The data points are assumed to be in temporal order, where $\, a_{i}$ was taken between $\ x_{i}$ and $\ x_{i+1}$.

Contain all actions to be distance preserving transformations in the feature space.

Now we realize that intuitively, if two points have the same action, then the distance between them should not change after the action. There are only three types of transformations that do not change the distance between points. They are rotation, reflection, and translation. Since rotation and reflection have the same properties, we will call both of these rotation for our purposes. Any other transformation does not preserve distances. Note that we are able to model any action in low dimensional space by rotation, translation or any affine combination of them.

Action Respecting Embedding has similar algorithm to SDE[7]

Algorithm

1) construct a neighbourhood representation

2) solve to find maximum variance subject to constraints

3) construct lower dimensional embedding from eigenvectors of the kernel

Differences:

1) construct neighbourhoods based on action pairs

2) adding action constraints

### Optimization Problem

Recall that kernel function satisfies the property: $\, |\phi(x_i) - \phi(x_j)|^2 = | x_i - x_j |^2$. We can formulate the idea of distance preserving mathematically as $\, |\phi(x_i) - \phi(x_j)|^2 = |\phi(x_i+1) - \phi(x_j+1)|^2$, if $\, x_i, x_j$ are points that perform the same type of transformations.

The maximum variance unfolding problem is

$\, \text{max tr}(K)$
such that
$\, |\phi(x_i) - \phi(x_j)|^2 = k_{ii} -2k_{ij} + k_{jj}$

$\sum\limits_{i,j} k_{ij} = 0$

$k \succeq 0$

The Action Respecting Embedding problem is similar to the maximum variance unfolding problem, with another constraint:

For all i, j, $a_{i} = a_{j}$,

$\,k_{(i+1)(i+1)} - 2k_{(i+1)(j+1)} + k_{(j+1)(j+1)} = k_{ii} - 2k_{ij} + k_{jj}$

### A Simple Example

Consider the two examples presented in the ARE paper [8] of a robot moving around taking pictures with some action between them. The input is a series of pictures with actions between them. Note that there are no semantics for the actions, only labels

The first example is a robot that takes 40 steps forward then takes a 20 steps backward. The following figure (which is taken from ARE paper [9] shows the action taking by the robot and the results of ARE and SDE. We can see that ARE captures the trajectory of the robot. Although ARE doesn't have semantics for each action it captured an essential property of these actions that they are opposite in direction and equal in magnitude. We could look at the position on manifold vs. time for a robot that took 40 steps forward and 20 steps backward. We could also look at the same analysis for a robot that took 40 steps forward and 20 steps backward that are twice as big.

Another complicated action and the result of ARE and SDE are shown in the next figure (taken from ARE paper [10]), where we can see that ARE captured the trajectory of the robot.

Sometimes it's very hard for humans to recognize what happened by merely looking at the data, but this algorithm will make it easier for you to see exactly what happened.

### Types of Applications

Action-respecting embeddings are very useful for many AI applications, especially those that involve robotics, because actions can be defined as simple transformations that are easy to understand. Another common type of application is planning, that is, when we need to find a sequence of decisions which achieve a desired goal, e.g. coming up with an optimal sequence of actions to get from one point to another.

For example, what type of treatment should I give a patient in order to get from a particular state to a desired state?

For each action $a$, we have a collection of data points $\, (y_{t},y_{t+1})$ such that $\, y_{t} \to y_{t+1}$.

Through the algorithm, we learn $\, f_{a}$: $\, f_{a}(y_{t}) = A_{a}y_{t} + b_{a} = y_{t+1}$ where $A_{a}$ is that rotation matrix for action $a$, and $b_{a}$ is the translation vector vector for action $a$. We can estimate $A_{a}$ and $b_{a}$. These can be solved in closed form so that we will know what happens to any input $(y_t)$ if action $a$ is taken.

### Spectral Clustering

Clustering is the task of grouping data so that elements in the same group are more similar (or dissimilar) to each other than to the elements in other groups. In spectral clustering, we represent first the data as a graph and then perform clustering.

For the time being, we will only consider the case where there are two clusters.

Given $\{x_i\}^{n}_{i=1}$ and notation of similarity or dissimilarity, we can represent the data as a weight undirected graph $\,G=(V,E)$ where V represents the vertices (the data) and E represents the edges (similarity or dissimilarity between two data - for now we will assume simliarity).

Using intuition, we can see that the goal is to minimize the sum of the weights of the edges that we cut (i.e. edges we remove from G until we have a disconnected graph)

$cut(A,B)=\sum_{i\in A, j\in B} w_{ij}$

where A and B are the partitions of the vertices in G made by a cut, and $\,w_{ij}$ is the weight between points i and j.

Consider the following graphs of data:

Although "cut" may minimize the edge weights that we use to partition, note that using cut(A,B) may cause many edges to be cut, but which have smaller weights. For example, if there is one datum that is very dissimilar to the rest of the data, we may end up partitioning this one point to itself despite it resulting in many edges being cut. This is the case in Example 3 above. This occurrence may cause an imbalance in partition sizes. To avoid this problem, we minimize the ratio-cut.

We define ratio-cut as: $ratiocut = \frac{cut(A, \bar{A})}{|A|} + \frac{cut(\bar{A}, A)}{|\bar{A}|}$ where $\, |A|$ is the cardinality of the set $\, A$ and $\, \bar{A}$ is the complement of $\, A$.

This equation will be minimized if there is a balance between set $\, A$ and $\bar{A}$. This, however, is an NP Hard problem; we must relax the constraints to arrive at a computationally feasible solution.

Let $f \in \mathbb{R}^n$ such that

$f_{i} = \left\{ \begin{array}{lr} \sqrt{\frac{|\bar{A}|}{|A|}} & i \in A\\ -\sqrt{\frac{|A|}{|\bar{A}|}} & i \in \bar{A} \end{array} \right.$

We will show in the next lecture that minimizing $\, \sum_{ij}w_{ij}(f_{i}-f_{j})^2$ is equivalent to minimizing the ratio-cut.

#### Choices of $W$

##### The $\epsilon$-neighbourhood graph

If the pairwise distance are smaller than $\epsilon$ between two points $i, j$, let $w_{ij}=1$, else let $w_(i,j)=0$. This similarity graph is generally considered an unweighted graph.

##### The $k$-nearest neighbourhood graph

If the $v_j$ is among the $k$-nearest neighbours $v_i$, then $w_{ij}=1$. This will result in a non-symmetric matrix, so we can either let

1. $w_{ij}=w_{ij}=1$ as long as $v_j$ is among the $k$-nearest neighbour of $v_i$ or $v_i$ is among the $k$-nearest neighbour of $v_j$; or let
2. $w_{ij}=w_{ij}=1$ only when $v_j$ is among the $k$-nearest neighbour of $v_i$ and $v_i$ is among the $k$-nearest neighbour of $v_j$.

The resulting matrix is symmetric.

##### The fully connected graph

As the graph is fully connected, the weight function should reflect the local neighbourhood relationships. A Gaussian similarity function is often chosen in this case.

## Spectral Clustering (Lecture 8: Oct. 8, 2014)

### Problem Formulation and Notations

Once we are given a set of points $\, x_{1},...,x_{n}$ and a sense of what it is to be similar or dissimilar, we can represent these data points as a weighted undirected graph. Let $\, G=( V, E)$ be a graph, where V is the set of vertices and E is the set of edges.

In graph theory, partitioning the set of vertices of a graph into two disjoint sets is called a cut. Consequently the set of edges which has one end point in each subset (partition) is called a cut-set, which crosses the cut. In a connected graph each cut-set determines a unique cut. So, in some cases cuts are identified with their cut-sets rather than with their vertex partitions.

The cut $\, C = ( A, B )$ partitions $\, V$ from $\, G$ into two subsets $A$ and $B$.

The cut-set of a cut $\, C = ( A, B )$ is the set $\{ ( i,j ) \in E | i \in A, j \in B \}$ of edges that have one endpoint in $A$ and the other endpoint in $B$. If $i$ and $j$ are specified vertices of the graph $G$, an $i$ $-$ $j$ cut is a cut in which $i$ belongs to the set $A$ , and $j$ belongs to the set $B$.

The weight of a cut is different for different types of graphs:

• In an Unweighted Undirected graph, the size or weight of a cut is the number of edges crossing the cut, or 1 for each edges crossing the cut.
• In a Weighted graph, the value or weight is defined by the sum of the weights of the edges crossing the cut.

Cut can be minimized or maximized, as defined below:

• A cut is minimum if the size or weight of the cut is not larger than the size of any other cut. In other words, partitioning vertices of a graph into two disjoint subsets that are joined by at least one edge is minimum cut, which is 2 here in the following example.
• A cut is maximum if the size of the cut is not smaller than the size of any other cut, which is 5 here in the following example.

The total weight of the cut is

$cut ( A, B) = \sum_{i \in A, j \in B} w_{ij}$

Minimizing this is a combinatorial problem. In order to result in partition that are closer to balanced, consider ratio cut defined in terms of the size of each subgraph as follows:

$ratiocut (A, \bar{A}) = \frac{cut( A, \bar{A})} {|A|} + \frac{cut( \bar{A}, A)} {| \bar{A}|}$

Solving this problem is NP hard, and thus relaxation of the constraints is required to arrive at a sub-optimal solution. Consider the following function $f$:

$f \in {R}^n$ $f_i = \left\{ \begin{array}{lr} \sqrt{\frac{| \bar{A}|}{| A|}} & i \in A\\ -\sqrt{\frac{| A|}{| \bar{A}|}} & i \in \bar{A} \end{array} \right.$

The function $f$ assigns a positive value to any point $i$ that belongs to $A$, and a negative value to any point $j$ that belongs to $\bar{A}$.

Ex. If $| A| = | \bar{A}|$ then the values for $i$ will be +1 and values for $j$ will be -1.

### Derivation

Problem: Prove that minimizing $\, \sum_{i,j} w_{ij} * ( f_i - f_j)^2$ is equivalent to minimizing ratiocut $( A, \bar{A})$

Proof:

We have 4 cases:

1. $i \in A \; and\; j \in A$ ,
2. $i \in A \; and\; j \in \bar{A}$ ,
3. $i \in \bar{A} \; and\; j \in A$ , and
4. $i \in \bar{A} \; and\; j \in \bar{A}$

$\sum_{i,j} w_{ij} * ( f_i - f_j)^2 = \sum_{ i \in A, j \in \bar{A}} w_{ij} * (\sqrt{\frac{| \bar{A}|}{| A|}} + \sqrt{\frac{| A|}{| \bar{A}|}}) ^ 2 + \sum_{ i \in \bar{A}, j \in A} w_{ij} * (- \sqrt{\frac{| A|}{| \bar{A}|}} - \sqrt{\frac{| \bar{A}|}{| A|}}) ^ 2$

since both

$\sum_{ i \in A, j \in A} w_{ij} * ( f_i - f_j)^2 =0 \;$
and
$\; \sum_{ i \in \bar{A}, j \in \bar{A}} w_{ij} * ( f_i - f_j)^2 = 0$

$= \sum_{ i \in A, j \in \bar{A}} w_{ij} * (\frac{| \bar{A}|}{| A|} + \frac{| A|}{| \bar{A}|} + 2 * \sqrt {\frac{| \bar{A}|} {| A|} * \frac{| A|} {| \bar{A}|}}) + \sum_{ i \in \bar{A}, j \in A} w_{ij} * ( \frac{| A|}{| \bar{A}|} + \frac{| \bar{A}|}{| A|} + 2 * \sqrt {\frac{| A|} {| \bar{A}|} * \frac{| \bar{A}|} { | A|}})$

Note that

$\frac{| \bar{A}|} {| A|} * \frac{| A|} {| \bar{A}|} = I$
and
$2 = 1 + 1 = \frac{| A|} {| A|} + \frac {| \bar{A}|} {| \bar{A}|}$

$= \sum_{ i \in A, j \in \bar{A}} w_{ij} * (\frac{| \bar{A}|+| A|}{| A|} + \frac{| A| + | \bar{A}|}{| \bar{A}|}) + \sum_{ i \in \bar{A}, j \in A} w_{ij} * ( \frac{| A|+| \bar{A}|}{| \bar{A}|} + \frac{| \bar{A}| + | A|}{| A|} )$

$= (\frac{| \bar{A}|+| A|}{| A|} + \frac{| A| + | \bar{A}|}{| \bar{A}|}) * \sum_{ i \in A, j \in \bar{A}} w_{ij} + (\frac{| \bar{A}|+| A|}{| A|} + \frac{| A| + | \bar{A}|}{| \bar{A}|}) * \sum_{ i \in \bar{A}, j \in A} w_{ij}$

Note that

$\sum_{ i \in A, j \in \bar{A}} w_{ij} = cut ( A, \bar{A})$
and
$\sum_{ i \in \bar{A}, j \in A} w_{ij} = cut ( \bar{A}, A)$ .

Therefore, our main equation becomes:

$(\frac{| \bar{A}|+| A|}{| A|} + \frac{| A| + | \bar{A}|}{| \bar{A}|}) * cut ( A, \bar{A}) + (\frac{| \bar{A}|+| A|}{| A|} + \frac{| A| + | \bar{A}|}{| \bar{A}|}) * cut ( \bar{A}, A)$

Since $cut( A, \bar{A}) = cut( \bar{A}, A)$, then:

$(\frac{| \bar{A}|+| A|}{| A|} + \frac{| A| + | \bar{A}|}{| \bar{A}|}) * 2 * cut ( A, \bar{A})$

Let $n = | \bar{A}| + | A|$, then:

$(\frac{ n}{| A|} + \frac{ n}{| \bar{A}|}) * 2 * cut ( A, \bar{A})$

$= 2 * n * (\frac{cut( A, \bar{A})}{| A|} + \frac{cut( \bar{A}, A)}{| \bar{A}|})$

$= 2 * n * ratiocut( A, \bar{A})$

Since n is always non-negative, minimizing ratiocut is equivalent to minimizing $2 * n * ratiocut( A, \bar{A})$. Therefore, minimizing $\, \sum_{i,j} w_{ij} * ( f_i - f_j)^2$ is equivalent to minimizing $ratiocut( A, \bar{A})$. This completes the proof.

Problem: find an appropriate relaxation of the optimization problem.

Define $\, {W = [w_{ij}]_{n \times n}}$ to be the matrix of all weights.

${W = \begin{bmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1,n} \\ w_{2,1} & w_{2,2} & \cdots & w_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n,1} & w_{n,2} & \cdots & w_{n,n} \end{bmatrix}}$

Define $D$ to be a diagonal nxn matrix such that $d_i$ is the sum of the elements in row $i$ of $W$:

$\, {d_i = \sum_j w_{ij}}$

${D = \begin{bmatrix} d_{1} & 0 & \cdots & 0 \\ 0 & d_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_{n} \end{bmatrix}}$

Define $L$ is a ${n \times n}$ matrix such that $L = D - W$, $L$ is called the Laplacian of the graph.

Show: $\, {2 f^TLf = \sum_{ij} w_{ij} (f_i - f_j)^2}$

Proof:

$\, {L}$ and $\, {w_{ij}}$ are known.

$\, {L.H.S = 2 f^T L f = 2 f^T (D-W) f = 2 f^T D f - 2 f^T W f}$

Note that $\, {f^T D f}$ can be written as ${\sum_i d_i f_i^2}$ and $\, {f^T W f}$ can be written as $\, {\sum_{i,j} w_{ij} f_i f_j}$

Recall that we want to reach to the quadratic form in the right hand side ${(f_i- f_j)^2 = f_i^2 - 2f_i f_j + f_j^2}$

${L.H.S = 2 \sum_i d_i f_i^2 - 2 \sum_{i,j} w_{ij} f_i f_j = \sum_i d_i f_i^2 + \sum_{i} d_i f_i^2 - 2 \sum_{i,j} w_{ij} f_i f_j}$

By changing the $i$ in the second term into $j$

${L.H.S = \sum_i d_i f_i^2 + \sum_{j} d_j f_j^2 - 2 \sum_{i,j} w_{ij} f_i f_j = \sum_i d_i f_i^2 - 2 \sum_{i,j} w_{ij} f_i f_j + \sum_{j} d_j f_j^2}$

By substituting the value of $d_i$

${L.H.S = \sum_{i,j} w_{ij} f_i^2 - 2 \sum_{i, j} w_{ij} f_i f_j + \sum_{i,j}w_{ij}f_j^2 =\sum_{i,j} w_{ij} (f_i - f_j)^2 = R.H.S}$

Which means minimizing the R.H.S minimizes the L.H.S

### Selection of L

In class we discussed the Laplacian, other selections for L include normalized Laplacians such as $\, L_{rw}=D^{-1}L=I-D^{-1}W$ or $\, L_{sym}=D^{-\frac{1}{2}}LD^{-\frac{1}{2}}$

### Relaxation of constraint

Observe that $f^T f = \sum_{i=1} f_i^2 = \sum_{i \in A} \frac{|\bar{A}|}{|A|} + \sum_{i \in \bar{A}} \frac{|A|}{|\bar{A}|} = |A|*\frac{|\bar{A}|}{|A|} + |\bar{A}| * \frac{|A|} {|\bar{A}|} = n$

Therefore our goal is to solve the following optimization problem

$\, \text{min}_f( {f^T L f})$

$\, s.t. {f^T f = n}$

instead of solving the more difficult ratiocut optimization problem

Our new problem can be solved by using Lagrange multiplier:

$\, {L(f, \lambda) = f^T L f - \lambda (f^Tf -n)}$

${\frac{\delta L}{\delta f} = 2 Lf - 2 \lambda f}$

Set $\frac{\delta L}{\delta f} = 0$, we get: $\, { Lf = \lambda f}$
The equation above is clearly satisfied if and only if $f$ is an eigenvector of L, and $\lambda$ is the corresponding smallest eigenvalue.

[11] Since a graph with one connected component has only one constant eigenvector ${u = 1_n}$ that is associated with ${0}$ eigenvalue. And as we can see from the previous equation that ${f}$ is the eigenvector of ${L}$. So ${f}$ is the eigenvector corresponding to the second smallest eigenvalue, with positive elements corresponds to one class and negative elements correspond to the other one.

Similarly to LLE, the eigenvector corresponding to the zero eigenvalue is ignored. It turns out that LLE computes a specific Laplacian for the graph.

### Spectral Clustering Results

The following figures show [12]the results of the spectral clustering against other clustering algorithms. As clustering is considered as an ill-defined problem it's hard to say that one algorithm groups the points better than other algorithms. However, in the figures below, spectral clustering groups the points as a person may.

The most notable issue with spectral clustering is its dependency on the definition of "similar" and/or "dissimilar". For example, suppose a class of students were to be cut into two groups. If weights were given to pairs of students according to the similarity of their gender, then the two groups would presumably be separated into males and females. However, this weighting doesn't account for other similarities the class might have such as colour of clothing worn that day, or less visible similarities such as number of hours slept the previous night. The definition of "similar" is therefore what drove the cut, and not the actual similarities between the students. Spectral clustering is a problem that only has subjective solutions due to the subjective nature of the problem.

### Multiclass Spectral Clustering

Spectral clustering can be used in multiclass case by choosing the smallest $k+1$ eigenvectors and ignoring the smallest one (associated with $0$ eigenvalue), this will result in a matrix of size $n * k$ which can be used as a covariate of other clustering algorithm.

Alternatively, the k eigenvectors can be used to create $2^k$ clusters where the the cluster to which a point is assigned is given by converting the signs of each element in the matrix to binary then reading off the cluster number of each row as the number whose binary representation is given by that row. This is akin to cutting the data into two parts $k$ times where each subsequent cut approximates the best ratiocut based cut when the previous cuts have already been made. This assigns points with the same sign in the matrix to the same cluster.

### Laplacian Eigenmap

Laplacian eigenmap [13] solves the problem of dimensionality reduction by obtaining the eigenvectors that correspond to the lowest ${d+1}$ eigenvalues and project the data to these eigenvectors. This is very similar to the spectral clustering.

We can interpret the multiclass spectral clustering as a dimensionality reduction step before applying a clustering technique.

Also spectral clustering can be interpreted from Random walk point of view as explained here [14]: "Random walk is a stochastic process on a graph which moves randomly from node to node. So spectral clustering tries to find a graph partition such that random walk stays long within a cluster and with seldom moves from one cluster to another."

A side note: The frobenius norm of the covariance matrix can measure the linear correlation between two variables, the larger the norm the larger the correlation. And this is true only for linear correlation as for the nonlinear correlation data should be mapped to Hilbert space first before applying the covariance operator and obtain the Schmidt norm to get a measure of dependence. Supervised PCA uses this concept.

The following is a matlab sample code that shows the norm of a covariance matrix for uncorrelated, linearly correlated, non-linearly correlated random variables.

>> X1 = randn(1, 10000);
>> X2 = randn(1, 10000);  % x1 and x2 are independent
>> X = [X1;X2];
>> cov(X')        % 2 by 2 matrix.
%    Covariance is small since x1 and x2 and independent.  So we look at the norm of covariance.
>> norm(cov(X'))  % small as X1 and X2 are independent.
%    This is a measure of correlation between these two, which is a measure of linear dependence.
>> X2 = 2 * X1;
>> X = [X1;X2];
>> cov(X')
>> norm(cov(X')) % large as X1 and X2 are linearly dependent
>> X2 = sin(X1);
>> X = [X1;X2];
>> cov(X')
>> norm(cov(X')) % not large as the dependency between X1 and X2 is not linear.


LE(Laplacian eigenmap) uses the similar approach with LLE. It is fast but sometimes can not provide an ideal outcome. It's difference with other DR(dimension reduction) methods is that LE maintains an ideal robustness while there are outliers in the dataset, which is a property that other DR methods do not process.

## Supervised Principal Component Analysis (Lecture 9: Oct. 15, 2014)

### Assignment 1 Solutions

Overall the first assignment was done quite well. The average mark for undergrads was approximately 70% and the average grade for graduates was approximately 75%. This lecture will focus on the details of the first assignment and going over the solutions. The solutions will not be posted online.

•  Question 1 was done quite well. Some marks were deducted for unusual scaling of eigenvalues (division by (n-1)). In general it is a good idea to scale the eigenvalues, however, it will usually not affect whether the PCA function works or not. Marks were given simply for having a picture for most of the questions. Note that the explanations for the questions should have some detail, one-word answers are not sufficient.
• Question 2 part(a) had some issues. Some students plotted too many eigenvectors. Also in general, the fewer PCs we keep, the better. When determining how many eigenvectors to keep, look for an "elbow" in the plot for when the variance represented by a given eigenvector drops off.
• Question 3 was very well done, as long as you had some pictures, you would have gotten most of the marks.
• Question 4 had some issues. These are very well-known properties, so you should prove it using the definitions and show more detailed steps.
• Question 5 was marked based on how far a student correctly followed the solution. Take the partial derivatives according to $\mu$, and according to $y_i$, set equal to 0 and simplify. Substituting one into the other and further simplifying gives the answer.
• Question 6 part(a) as well as part(c) were done well. Part(b) required showing that all eigenvectors of $\tilde S$ are also eigenvectors of $\, S$

### Introduction to Supervised Principal Component Analysis

Motivation: There are several applications in which we would like to reduce the dimensionality of data to be able to deal with it easier, however we would like to keep some desired properties or features of data in low dimensional representation at the same time. In fact, we don't want to reduce the dimensionality blindly.

For example, suppose that we would like to rank undergrad students in CS departments in top universities in Canada according to their grades in different courses. Suppose that there are 2000 students and 20 common courses. Since the data set is large, we want to reduce the dimensionality (number of courses) to 3 dimensions to process faster. If we run a PCA algorithm blindly, the algorithm returns the 3 courses in which students' grades have most variation. Assume there is a course (for example, Data Visualization) that we would like to keep in our top three courses because its grade is important in our evaluation.

Sometimes there is a feature or dimension of the data that we don't want to omit during dimensionality reduction. This is where supervised PCA is used.

Principal component analysis (PCA) may not be the ideal way for processing multiple sets of data at once or data with different classes of characteristics (dataset with different modes of variation), since the principal component which maximizes variation may not preserve these characteristics/classes. As such, maximizing the variance sometimes is not the best way to reduce the dimensionality of the data. As shown in the below figure (taken from here [15]), we can see that projecting the data on the direction of maximum variation will not make the two classes discriminated in the lower dimensions compared to FDA.

Sometimes we will do clustering as a pre-algorithm, or we can incorporate side data and use labels in PCA in order to create a better dimensionality reduction. In these cases, we are more interested in finding subspaces with respect to the labels of the original data instead of those that maximizes the variance.

Example to show the shortcoming of PCA compared to FDA.

In dimensionality reduction we might have some "side information" or labels. In low dimension data we want points with the same label to be close to each other. Consider if you have a set of images of faces. The different faces can be divided into glasses/no glasses classes or beards/no beards classes, or I have other information about which people are similar. In low dimensional representation I want these points to be close to each other. KPCA may be able to distinguish some pre-specified features in the picture, while PCA could only give components that have max variance. In dimension reduction, we might have some "side information" or labels.

### Problem Statement

Given a set of data $(X,Y),\; X \in \mathbb{R}^p$ are p-dimensional explanatory variates and $Y\in \mathbb{R}^l$ l-dimensional response variables, the goal is to estimate a low-dimension subspace S such that S contains as much predictive information about the response variable Y as the original explanatory variable X.

Alternatively, given an i.i.d (independent and identically distributed) sample $\{(x_1,y_1), (x_2,y_2) \dots (x_N,y_N)\}$, we are looking for an orthogonal projection of X onto $\,S=U^T X$ such that Y depends mainly on $\,U^T X$.

### Related Work

There are many families of algorithms to solve this problem, some of which are briefly described below.

#### Classical Fisher's discriminant analysis (FDA)

This algorithm is very famous, however it will not be covered in this class. This[16] or [17] are brief introductions to FDA.

#### Metric Learning (ML)

The technique attempts to construct a Mahalanobis distance over the input space, which could then be used instead of Euclidean distances. ML methods search for a metric (or equivalently, a linear transformation) under which points in the same class (similar points) are near each other and simultaneously far from points in the other classes. We need to minimize the sum of distances between similar class samples and maximize the sum of distance between different class samples.

Euclidean Distance: $\,(x_i - x_j)^T(x_i - x_j)$

Mahalanobis Distance: $\,(X_i - x_j)^TA(x_i - x_j)$ , where A is a positive, semi-definite matrix

Write $\, A = UU^T$

$\,d_A(x_i, x_j) = (U^Tx_i - U^Tx_j)^T(U^Tx_i - U^Tx_j)$

Thus, we obtain the mapping

$x \mapsto U^Tx$

More information can be found in Metric Learning.

This website introduces several strategies on metric selection Distance Metric Learning

#### Sufficient Dimension Reduction (SDR)

In sufficient dimension reduction, we attempt to find a linear subspace $S \subset X$ such that $P_{Y\mid X}(y\mid X)$ is preserved.

i.e $P_{Y\mid X}(y\mid X) = P_{Y\mid U^TX}(y\mid U^TX)\; \forall x,y$

#### Supervised Principal Components (SPCA)

• This version of Supervised Principal Components is different from the the Supervised Principal Component Analysis that we are going to talk about in this lecture. It is proposed by Bair er. al., and the original paper can be found here[18].
• The idea is to perform PCA only on selected data with a strong estimated correlation to the response variable (Y). This removes some unseen factors that are unrelated to the response variable (Y).
• This method is commonly used in biostatistics, but it's different from the one we are covering this course.
##### Steps
1. Recall that in linear regression the coefficient for feature j is computed as:
$s_j = \dfrac{X_j^T Y}{\sqrt{X^T_jX_j}}$
Note that $X$ doesn't have to be centered. This is essentially a measure of similarity between feature j (row j) and output Y.
2. Replace the data matrix so that it only consists of features for whom the absolute value of the calculated coefficients exceeds some determined threshold $\theta$.
3. Compute the first (few) principal components of the reduced data matrix.
4. Use the principal component(s) calculated in step 3 in a regression model or classification algorithm to predict the outcome.

SPCA method focuses on a region of interest from the spatial domain as training set, thus it will be more suitable to extract the property of data that is at most interest.

### Hilbert-Schmidt Independence Criterion (HSIC)

Two random variables $X\in \mathbb{R}^p$ and $Y\in \mathbb{R}^q$ are independent if and only if any bounded continuous function of them is uncorrelated.

If we map these random variables to reproduce kernel Hilbert spaces (RFHS), we can test for correlation. Problem: Given ${(x_1, y_1), . . . ,(x_m, y_m)}\∼ . Pr(x, y)$ determines whether $\,Pr(x, y) = Pr(x)Pr(y)$.

To do so, we define a RKHS $F:X\to \mathbb{R}$ containing all continuous bounded real-valued functions of x,

and a RKHS $G:Y\to \mathbb{R}$ containing all continuous bounded real-valued functions of y.

Distance correlation can provide a new approach to testing the joint independence of random vectors. Measuring and testing dependence by correlation of distances is explored here.

Let $Z:= \{(x_1,y_1),\dots, (x_n,y_n) \} \subseteq X\times Y$ be n independent observations from $\, \rho_{X,Y}$.

An estimator of HSIC is:

$HSIC:= \dfrac{1}{(n-1)^2}Tr(KHBH)$

where $H,K,B \in \mathbb{R}^{n\times n}$, $\,K_{ij} = K(x_i,x_j)$ is a kernel of X, $\,B_{ij} = b(y_i,y_j)$ is a kernel of Y,and $\,H = I - \frac{1}{n}ee^T$ is the centering matrix. Note that K and B are positive semidefinite.

The HSIC is 0 if and only if X and Y are independent. Thus the estimator described above can be used to test for independence. For more details, see [19].

How far are two distributions from each other? A measure of similarity of $\, P$ and $\, Q$ is $\, X \sim P, Y \sim Q$ where P and Q are probability distributions. We can calculate the distance between distributions $\, P$ and $\, Q$ as follows:

$\mid P-Q \mid^2 = \mid \frac{1}{n}\sum \phi(x_i)-\frac{1}{m}\sum\phi (y_i) \mid^2$
$= (\frac{1}{n}\sum \phi(x_i)-\frac{1}{m}\sum\phi (y_i))^T (\frac{1}{n}\sum \phi(x_i)-\frac{1}{m}\sum\phi (y_i))$
$= \frac{1}{n^2} \sum K(x_i,x_j)+\frac{1}{m^2}\sum K(y_i,y_j) -\frac{2}{mn} \sum K(x_i,y_j)$

So does the mean capture all the characteristics of the distribution in the higher dimensional space. The idea of this measure of distance is counter-intuitive at first glance, since it states that after mapping into high dimension space using $\phi(x)$, we can measure the distance of distribution by using the mean of the sample data points. It's difficult to find an explicit $\phi(x)$, but we can use kernel trick to solve the problem.

Note that, the mean in the original dimensions is not enough to capture the characteristics of the distribution. For example, the following figure shows a two different distributions with exact same mean but with different variance. So, say we have a point in 2 dimensions, the first dimension of the point is the mean and the second dimension of the point is the variance. So, this point captures these two characteristic of the distribution. So, in infinity dimension, a point can capture all momentum of the distribution and therefore, only mean in higher space can capture the characteristic of the distribution.

Example of different distributions with same mean.

Now that we have a way to measure the distance between two distributions, we have a way to test for independence since:

$X, Y$ independent $\Rightarrow P_{X,Y}(x,y) = P_X(x)P_Y(y) \Rightarrow \mid P_{X,Y}(x,y)-P_X(x)P_Y(y) \mid = 0$

## Hilbert Schmidt Independence Criterion (Lecture 10: Oct. 20, 2014)

### Introduction to the Hilbert Schmidt Independence Criterion (HSIC)

Last lecture we reviewed the background and set-up for SPCA. Given data, X, and corresponding labels, Y, if we want to maximize the dependency between $U^TX$ (the projected data) and $Y$, then we will need the Hilberts Schmidt Independence Criterion (HSIC). When data are mapped to a higher dimension space (e.g. a kernel), we hope that the data and dependency are both linear with respect to that space. However, some cases won't necessarily conform to these assumptions. This was explained last lecture with the example of two distributions with the same mean, but with different variances.

Intuition: Two random variables $x$ ~ $p$ and $y$ ~ $q$ in $R^n$ are independent if and only if any bounded continuous function of them is uncorrelated

$|p-q|^2$ (which is called Maximum Mean Discrepancy (MMD) [20]) measures distance between two distributions $p$ and $q$. It also allows us to check independence since $p(x,y)=p(x)p(y)$ implies independency. We can re-write the distance equation as $|p(x,y)-p(x)p(y)|^2$ where expanding $p(x,y)-p(x)p(y)$ gives the Hilbert Schmidt Independence Criterion (HSIC). This term is based on the norm of the covariance matrix in Hilbert space.

#### Derivation of HSIC

Goal: Given an $i.i.d$. sample, $\,X$, and labels, $\,Y$, we want an orthogonal projection of the data onto $\, U^TX$ such that $\,Y$ depends on $\,U^TX$.

An estimator of $HSIC$ is given by:

$HSIC = \frac{1}{(n-1)^2} Tr(KHBH)$
where
• $K$ is a kernel over $X$, positive semidefinite
• $B$ is a kernel over $Y$, positive semidefinite
• $H$ is the centering matrix given by $H=I-\frac{1}{n} ee^T$

We then want to maximize maximize the dependency of $\,Y$ on $\, U^TX$

This is done by:

1. Compute the linear kernel of $\,U^TX$ :
$\,K_{n \times n}=X^TUU^TX$ is a $\,n \times n$ matrix
Notice the similarity to Kernel PCA, which finds $\,X^TX$, the linear kernel of $\,X$
2. Compute a kernel of labels $\,Y$ :
$\,B_{n \times n}=Y^TY$ is a $\,n \times n$ matrix
3. According to HSIC, the dependency between $\,U^TX$ and $\,Y$ can be computed as
$\,\underset{U}{max} \; \frac{1}{(n-1)^2}Tr(KHBH)$
where $K_{n \times n}= (U^TX)^T U^T X = X^TUU^TX$
• Since $\frac{1}{(n-1)^2}$ doesn't depend on $U$, we only need to maximize the trace, where the only unknown is $U$
• So we solve
$\underset{U}{max} \; Tr(X^TUU^TXHBH)$

Since the trace of a matrix is invariant under cyclic permutations, this can be rewritten as:

$= \underset{U}{max} \; Tr(HX^TUU^TXHB)$
$= ... = \underset{U}{max}\; Tr(U^TXHBHX^TU)$
Letting $\,Q = XHBHX^T$, the problem becomes $\underset{U}{max} \; U^TQU$.
• There is no upper bound on this optimization problem which means with larger U our objective function get bigger and bigger. So, our problem is ill-defined. Thus, we need put such a constraint to restrict $\,U$ and solve our optimization problem under that consideration.
Therefore, we are going to solve our objective function such that: $\,U^TU = I$ (a.k.a. $\,U$ is an orthogonal matrix) as our restriction, just like with $PCA$
• To optimise this problem, use the Lagrange Multiplier:
$L(U, \Lambda) = Tr(U_{}^TXHBHX^TU^T)-Tr(\Lambda(U^TU-I))$
$\, = Tr(U^TQU)-Tr(\Lambda(U^TU-I))$
Taking the derivative of this with respect to U and setting the derivative to zero yields
$\,2QU - 2\Lambda U = 0$
$\,QU = \Lambda U$
$\,(XHBHX^T)U = \Lambda U$
Hence adding this restriction implies we need to find the $p$ largest eigenvectors of $XHLHX^T$ which is the optimal solution for U
• $\,U^TX$ is the projection of X on a lower dimension space, but in this context we are ensuring it has maximum dependence on $Y$, rather than trying to maximize variance as with $PCA$

Note that if $\,B=I$, then $\,X_{}HBHX^T = XHHX^T$. Since this is now the sample covariance matrix of centered $\,X$, the solution to $SPCA$ will be identical to the solution to $PCA$. Therefore $PCA$ is just $SPCA$ with the condition that $\,B=I$.

If $\,B=I$, then, because $\,B$ is the kernel over all the labels and so represents the pairwise similarity over labels, we are working only under the knowledge that each datum is similar to itself - we do not know how similar any datum is to another datum. Or we can interpret it as knowing that each sample is in a separate class, this is how PCA is conducted.
SPCA is therefore PCA with additional knowledge regarding similarity between data. So if you don't have labels use PCA, if you do have more information about the similarity of the points then you can impose it using SPCA. Therefore PCA can be seen as a special case of SPCA.

#### Algorithm for SPCA

1. Recover basis
Calculate $\; Q=XHBHX^T$
Define $U$ based on the top $p$ eigenvectors of $Q$ corresponding to the top p eigenvalues
2. Encode training data
Let $\; Z=U^TXH$ where $Z_{d \times n}$ is a $d \times n$ matrix
3. Reconstruct training data
$\hat{X} = UZ = UU^TX$
4. Encode test example
$\; z = U^T*(x-\mu)$ where $z$ is a $p$-dimensional encoding of $x$
5. Reconstruct test example
$x=Uz = UU^T(x-\mu) \;$

Note that if $B$ is the linear kernel over Y then the dimension of the embedding one can create with this algorithm is bounded above by the rank of $Y^T Y$ since in this case the rank of $Q$ is bounded above by rank of $Y^T Y$. Any additional eigenvectors returned if the algorithm were asked for more would be arbitrary vectors from the null space of $Q$.

### Dual Supervised Principal Component Analysis

Recall how Dual PCA led to Kernel PCA. Dual PCA was based on cases where dimensionality $d$ of data is large compared to the number of samples $n$, and hence the eigenvectors of $\, XX^T_{d*d}$ are hard to find. Instead, singular value decomposition was used to decompose the data into $X = U \Sigma V^T$ where

U is the eigenvectors of $\, XX^T_{d*d}$
V is the eigenvectors of $\ X^TX_{n*n}$

re-arranging shows $\ U = XV \Sigma^{-1}$

Similarly, we can decompose $\, Q = XHBHX^T$ instead of just decomposing $X$. First as $\Psi$ and $Q$ are positive semi-definite matrices we define

• $\ Q = \Psi \Psi^T$ (As it's Positive Semi-Definite (PSD) matrix
• $\ B = \Delta^T \Delta$

So $\ Q = \Psi \Psi^T = XH\Delta^T \Delta H X^T = (XH\Delta^T) (XH^T\Delta^T)^T = (XH\Delta^T) (XH\Delta^T)^T$ So $\ \Psi$ can be calculated as

• $\ \Psi =XH \Delta^T$

The solution for U can be decomposed from $\, \Psi =U \Sigma V^T$. Note that DSPCA depends only on the number of points, and is independent of the dimensionality.

Note that if $B$ is the linear kernel over Y then the dimension of the embedding one can create with this algorithm is bounded above by the rank of $Y^T Y$ since in this case the rank of $Q$ is bounded above by rank of $Y^T Y$. Any additional eigenvectors returned if the algorithm were asked for more would be arbitrary vectors from the null space of $Q$.

### Kernel Supervised Principal Component Analysis (Kernel SPCA)

In Kernel SPCA, we maximize $Tr(U^TXHBHX^TU)$ subject to $U^TU=I$. There is a trick to doing this: U can be written as a linear combination of all data.

$U = \underset{i}{\sum} \alpha_{i} \; x_i$

This algorithm allows us to re-write our problem using dot product. In matrix form, we assume that, for each column of U, there are n coefficients such that $\, U=X \beta$ where $U$ is d x p, $X$ is d x n, $\beta$ is n x p

• Our objective function can be re-written in terms of $\beta$ as
$\underset{\beta}{max} \; Tr(U^{T}XHBHX^{T}U)$
$= \underset{\beta}{max} \; Tr(\beta^TX^TXHBHX^TX\beta)$
$= \underset{\beta}{max} \; Tr(\beta^TKHBHK\beta)$
notice that $X^TX$ and $XX^T$ are our linear kernel: $\psi (x^T) \psi (x) = \psi (x) \psi (x^T)$

This constraint term can be re-written in terms of $\beta$ as:

$U^TU=I$ which implies $\beta^TX^TX \beta = I$ which implies $\beta^T K \beta = I$

Therefore the new optimization problem is $\underset{\beta}{max} \; Tr( \beta^TKHBHK \beta)$ subject to $\beta^T K \beta = I$

Here we don't have $X$, but rather we have $K$. If we were to write the Lagrange multiplier, we can see that the solution to this problem is the generalized eigenvectors of $HBHK$:
$\; L(\beta) = Tr( \beta^T KHBHK \beta ) - \lambda Tr( \beta^TK \beta - I)$
$\frac{\partial L}{\partial \beta} = 2KHBHK \beta - 2 \lambda K \beta = 0$
$let \; Q=KHBHK$
this implies that $Q \beta = \lambda K \beta$
Notice that if this was just $Q \beta = \lambda \beta$, then Q would be the eigenvectors, and $\lambda$ the eigenvalues of $\beta$. However, since we instead have $\lambda K$, we are solving for the generalized eigenvectors of $\, Q=HBHK$
Therefore the solution will be the top p eigenvectors of the generalized eigenproblem $eigen(Q,K)$

In high dimensional space, X isn't explicitly available. However, by denoting$\, X \mapsto\Phi(X)$, then $\, U= \Phi(X) \Beta$, but we don't have X. Therefore we need to project $\, \Phi(X)$ to a lower dimension space and compute
$\, Y=U^T \Phi(X) = \beta^T \Phi^T(X) \Phi(X) = \beta^T K$ where $\, Y$ is p x n, $\, \beta^T$ is p x n, and $\, K$ is n x n.

To project an out-of-sample point, x, we have:

$\, y=U^T \Phi(x) = \beta^T \Phi(X) \Phi(x) = \beta^T K_{test}$ where
• $\, X$ is the training data
• $\, x$ is the test data
• $\, K_{test}$ is the kernel between the training and test data. (ie. it codes how similar the training and test data are)

Note: There is an alternative way to derive the Kernel SPCA, which will be shown on Assignment 2

#### Algorithm for Kernel SPCA

Input:

• $\mathbf{K}$ the kernel matrix of the training data
• $\mathbf{K_{test}}$ the kernel matrix of the test data
• $\mathbf{B}$ kernel matrix of target variable

Note: in lecture note professor use $\mathbf{L}$ to represent the kernel of target variable. For consistency $\mathbf{B}$ is used instead.

• $\mathbf{x}$ testing example
• $\mathbf{n}$ training data size

Output:

• $\mathbf{Z}$ dimension reduced training data
• $\mathbf{z}$ dimension reduced test data

1. Define: $H = I - \frac{1}{n}ee^T$ and $\ Q = KHBHK$
2. Compute basis: $\beta_{n \times p}$, the top $p$ generalized eigenvectors of $Q$ and $K$
3. Encode the training data: $\ Z = \beta^T \Phi^T(X) \Phi(X) = \beta^TK$
4. Encode the test example: $\ z = \beta^T \Phi^T(X) \Phi(x) = \beta^TK_{test}$

Note that if $B$ is the linear kernel over Y then the dimension of the embedding created with this algorithm has an upper bound defined by the rank of $Y^T Y$. This is because the rank of $Q$ is bounded above by rank of $Y^T Y$. Any additional eigenvectors returned by the algorithm would be arbitrary vectors generated from the null space of $Q$.

#### MATLAB Example for Kernel Matrix

Polynomial degree 4 kernel with kernel function as defined in 'kernel.m'

 >> load X % data to find kernel matrix of
>>
>> n = size(X,2);
>> K = zeros(n);
>> for i=1:n
>>      for j =1:n
>>           K(i,j) = kernel(poly,X(:,i),X(:,j),4,[]);
>>      end
>> end


#### Recommended Kernels for Classification

##### Delta Kernel

When applying KPCA for classification purposes, the recommended kernel to use on Y is the delta kernel.

$\mathbf{K}(x, x') = 1$ if $\mathbf{x}$ and $\mathbf{x'}$ are equal, and 0 otherwise.

This kernel distinguishes between elements belonging to different classes.

##### RBF Kernel

A good kernel to use on X is the Radial Basis Function Kernel (RBF).

$\mathbf{K}(x, x') = exp(-\frac{\|x-x'\|^2}{2\sigma^2})$, where $\mathbf{\sigma}$ is a free parameter.

$\|x-x'\|^2$ is the squared Euclidean distance between the two vectors. The value of the RBF kernel decreases as distance between the two vectors increases.

Since the value of the kernel ranges between 0 and 1, it can also be interpreted as a similarity measure.

### Experimental Results Visualization

The following images are the 2-dimensional projections to the Iris flower data set produced by (in order from left to right):

a. Supervised PCA with $L = I$

b. Bair's SPC

c. Kernel Dimension Reduction

d. Kernel Supervised PCA

File:irisFlower.png
Comparing the different 2-dimensional projections of the Iris flower data set, image taken from here

The following are two more examples to show the effects of the 4 projections on different data sets. We are going to use the Binary XOR dataset and the Concentric Ring dataset to demonstrate.

##### The Binary XOR dataset

Original form of the Binary XOR dataset

File:binaryxor.png
The original form of Binary XOR data set, image taken from here

An 1-dimensional Gaussian noise is added to the dataset before projected to 2-dimensional space.

2-dimensional projections of the Binary XOR dataset produced by (in order from left to right):

a. Supervised PCA with $L = I$

b. Bair's SPC

c. Kernel Dimension Reduction

d. Kernel Supervised PCA

File:binaryxor all.png
Comparing the different 2-dimensional projections of the Binary XOR data set, image taken from here

KSPCA is able to clearly seperate the data into the 2 categories while the other methods fail to do so

##### The Concentric Ring dataset

Original form of the Concentric Ring data set

The original form of Binary XOR data set, image taken from here

2-dimensional projections of the Concentric Ring dataset produced by (in order from left to right):

a. Supervised PCA with $L = I$

b. Bair's SPC

c. Kernel Dimension Reduction

d. Kernel Supervised PCA

Comparing the different 2-dimensional projections of the Concentric Ring data set, image taken from here

KSPCA is able to seperate the data into 2 distinct clusters while the other methods are unble to do so

## SPCA continued and Metric Learning(Lecture 12: Oct. 22, 2014)

### SPCA continued

#### SPCA compared to other methods

The following figure compares the performance of different dimensionality reduction techniques across 5 datasets. The one-nearest-neighbor classifier is used, as it doesn't have parameters to tune for each dimensionality reduction algorithm, which guarantees a fair comparison between the different algorithms. (Figure from Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds by Barshana, Ghodsi, Azimifara, and Zolghadri Jahromia.)

We can only project up to k-1 dimensions using FDA, where k is the number of classes. Increasing the number of dimensions beyond $K-1$ has no effect and the error becomes flat.

Classification error rates on five UCI data sets for different projection dimensions, as computed by the algorithms supervised PCA (SPCA), kernel supervised PCA (KSPCA), KDR, Bair’s SPC (BSPC), CFML, FDA, and mKDR. (a) Balance, (b) Heart Disease, (c) Ionosphere, (d) Parkinsons, (e) Sonar.

Methods like KDR([21]) do not work in all of the datasets above, however SKPCA did pretty well for most of them. The consistently good performance of SKPCA lets us use it as a pre-processing classification. It can also be used as a pre-processing step for regression, which is another technique for dimension reduction.

KSPCA has the same runtime efficiency as SPCA because its optimization is based on the number of observations, rather than the number of dimensionality of the data.

#### Feature Selection

There are two main approaches for dimensionality reduction [22] [23]

• Feature selection: Selecting a subset of the existing features
• Feature extraction: Combining the existing features to produce a smaller number of features.

The formulation of SPCA can also be used for feature selection as follows:

Remember that we have $\,U^T X$ as our projection. We computed $\,U$ in a way that $\,U^T X$ has maximum dependence with the label $\,Y$.

Now consider only the first column of $\,U$ , denote it as the vector $\,u$. We can project the data into one dimensional space using this vector $\,u$.

If vector $\,u$ is sparse. i.e. $\,u=[0,0,...0,a,0,...0,b,0,...]$ we multiply it with $\,X$ to get a linear combination of rows of $\,X$. However, since we have a lot of zeros in $\,u$ the resulting matrix will only contain information of rows that is not multiplied by $0$. If we remove the unused rows we will still have maximum dependence with the label $\,Y$. This is equivalent to saying that we don't need these features/variables, because they have no dependence with $\,Y$. (i.e for $\,u=[0,0,...0,a,0,...0,b,0,...]$, the first few features of Y which corresponds to the first few zero entries of $\,u$ are redundant.)

To compute sparse eigenvectors, $\,U$, we can use LASSO [24]. The method was previously interpreted using linear programming in 1999. In 2009, a new efficient algorithm called coordinate descent [25] was introduced.

Using this method we can project the data on only one dimension. If we want to project on $p$ dimensions we will need to use a $p$-column matrix with block-sparsity property. Block sparsity can be achieved by minimizing the $l_{2,1}$ norm, defined by

$\|A\|_{2,1} = \sum_{i=1}^r \sqrt{\sum_{j=1}^p A_{ij}^2}$ [26]

### Metric Learning

#### Introduction

Metric learning computes the Mahalanobis distance instead of Euclidean distance over the input space. The Mahalanobis distance is defined as
$d_A(x_i,x_j) = \|x_i,x_j\|_A = \sqrt{(x_i - x_j)^T A (x_i-x_j)}$
where $A$ is a symmetric positive semi-definite matrix.

Motivation and Explanation:

Metric Learning was proposed by Xing et al in 2002. They proposed to construct a new distance metric over the points to use in place of Euclidean distances. In this approach, some amount of side information is employed to learn a distance metric that captures the desired information.

Suppose we only have class-related side information (not as strong as a label) about some data set. For example, we only know a pair of $\,(x_i, x_j)$ is similar or dissimilar. We notate Set $\,S$ containing all similar point pairs and Set $\,D$ containing all dissimilar point pairs:

$\mathcal{S}: (\boldsymbol{x_i, x_j}) \in \mathcal{S}$ , if $\boldsymbol{x_i}$ and $\boldsymbol{x_j}$ are similar;

$\mathcal{D}: (\boldsymbol{x_i, x_j}) \in \mathcal{D}$ , if $\boldsymbol{x_i}$ and $\boldsymbol{x_j}$ are dissimilar.

For square of Euclidean distance between $\,(x_i, x_j)$,

$\, |x_i - x_j|^2 = (x_i - x_j)^T (x_i - x_j)$

we can generalize it by adding a matrix $A$

$\, d^2_A(x_i,x_j) = \|x_i - x_j\|_A^2 = (x_i - x_j)^T A (x_i - x_j)$.

Matrix $A$ should make $\, d_{ij}$ satisfy the four properties of a metric:

1. Non-negativity: $\,d_{ij} \geq 0$
2. Coincidence: $\,d_{ii} = 0$
3. Symmetry: $\,d_{ij} = d_{ji}$
4. Triangular Inequality: $\,d_{ij} +d_{jk} \geq d_{ik}$

It can be proven that when Matrix $A$ is positive semi-definite, $\, d_{ij}$ must satisfy these four properties. The distance $\, d_{ij}$ is called the Mahalanobis distance.

Note that Mahalanobis distance is equivalent to Euclidean distance by doing Cholesky decomposition to Matrix $A$.

$\, A = w w^T$

thus, we have

$\, d^2_A(x_i,x_j) = (x_i - x_j)^T A (x_i - x_j)$

$\, = (x_i - x_j)^T w w^T (x_i - x_j)$

$\, = (w^T x_i - w^T x_j)^T (w^T x_i - w^T x_j)$

This can be interpreted as calculating the Euclidean distance after transforming $X$ using transformation matrix $W$ by $f(x)=W^TX$.

#### Distance Metric Learning

The main idea behind this method is to minimize the sum of distances between samples that should be similar while maximizing the sum of distances between samples that should be dissimilar. We need to find a matrix A that accomplishes this given that Matrix A is positive semidefinite, i.e. $\, A \geq 0$.

We can define an Objective Function where we want to minimize the sum of the distances between the pairs of points in Set S:

$\, min_{A} \sum_{(x_i,x_j) \in S} d_{ij}^2 = min_{A} \sum_{(x_i,x_j) \in S} ||x_i -x_j ||^2_A$

When $\, A =0$, this criterion becomes not useful. We define the following constraint:

$\, \sum_{(x_i,x_j) \in D} ||x_i -x_j ||_A \geq 1$

(Notice, not squared). This constraint is necessary because, if points are dissimilar, we don't want them to collapse into a single point. So we set the criterion to be, arbitrarily, greater or equal to 1.

The matrix $A$ changes the definition of distance/similarity. But we don't know A, so the problem is then reduced to finding $A$.

Xing et al proposed an approach to get Matrix A by Newton-Raphson method. However, this approach does not work well.

#### MCML approach (Maximally Collapsing Metric Learning)

A popular and efficient approach is Maximally Collapsing Metric Learning, proposed by Amir Globerson and Sam Roweis. The objective function here is a set of conditional probabilities $p^A(j|i)$. For each $\, x_i$, the authors introduced a conditional distribution over points $i \neq j$ such that

$p^A(j|i) = \dfrac {e^{-d^A_{ij}}}{\sum_{k \neq i} e^{-d^A_{ik}}} ~~ i \neq j$

In an ideal situation, all points in Set S (similar set) are mapped to a single point and are mapped infinitely far from the points in Set D (dissimilar set). The ideal distribution is then:

$p_0 (j | i) \propto \begin{cases} 1& y_i=y_j\\ 0& y_i \neq y_j. \end{cases}$

Our goal is to make the conditional as close as possible to the ideal case. When $\,p^A(j|i) = p_0 (j | i)$, then under A all points in the same class will be mapped to a single point, infinitely far from other class points. Thus our objective is to find a Matrix A such that $\,p^A(j|i)$ is as close to $\, p_0 (j | i)$ as possible. KL divergence [27] is applied to solve it.

KL divergence is a measure of the difference between two probability distributions. We denote the difference between probability distribution $\,P$ and $\,Q$ by $\,D_{KL}[P|Q]$.

Thus, the objective becomes minimizing the KL divergence between $\,p_0$ and $\,p^A$, i.e.

$min \sum_i KL[p_0(j|i)|p^A(j|i)] ~~ s.t ~~ A \geq 0$

#### Learning a Metric from Class-Equivalence Side Information

##### Ghodsi et al 's approach

Instead of minimizing the Mahalanobis distance, we define a loss function, which, when minimized, attempts to minimize the squared induced distance between similar points and maximize the squared induced distance between dissimilar points. So the new objective function becomes:

$L(A) = \dfrac {1}{\left| S\right|} \sum_{(x_i,x_j) \in S} ||x_i -x_j ||^2_A - \dfrac {1}{\left| D\right|} \sum_{(x_i,x_j) \in D} ||x_i -x_j ||^2_A$

The optimization problem then becomes $\, min_{A} L(A)$ such that $\, A\geq 0$ and $\, Tr\left( A\right) =1$.

The Matrix $\, A$ should still be positive semi-definite, hence the first constraint.

We also don't want to get a trivial solution of $\, A$ being 0 matrix (i.e. all zero distances), hence the second constraint.

Consider$\,\sum_{(x_i,x_j) \in S} ||x_i -x_j ||^2_A$, we can rewrite it as $\,\sum_{(x_i,x_j) \in S} (x_i-x_j)^T A (x_i-x_j)$

The term inside the summation is a constant, therefore it can be written as

$\,\sum_{(x_i,x_j) \in S} Tr[(x_i-x_j)^T A (x_i-x_j)]$

Since $\,A$is positive semi-definite, therefore $\,A = W^T W$

we can write the equation as

$\,\,\sum_{(x_i,x_j) \in S} ||x_i -x_j ||^2_A=\sum_{(x_i,x_j) \in S} Tr[(x_i-x_j)^T W^T W (x_i-x_j)]$

Again, since the term inside the trace is a constant, we can write it as

$\,\,\sum_{(x_i,x_j) \in S} ||x_i -x_j ||^2_A=\sum_{(x_i,x_j) \in S} Tr[W^T (x_i-x_j) (x_i-x_j)^T W]$

So the objective function becomes:

$\, min_{A} \quad L(A) = \dfrac {1}{\left| S\right|} \sum_{(x_i,x_j) \in S} Tr[W^T (x_i-x_j) (x_i-x_j)^T W] \quad - \quad \dfrac {1}{\left| D\right|} \sum_{(x_i,x_j) \in D} Tr[W^T (x_i-x_j) (x_i-x_j)^T W]$

We can further simply it by letting

$\, M_S = \dfrac {1}{\left| S\right|} \sum_{(x_i,x_j) \in S} (x_i-x_j) (x_i-x_j)^T$

$\, M_D = \dfrac {1}{\left| D\right|} \sum_{(x_i,x_j) \in D} (x_i-x_j) (x_i-x_j)^T$

and rewriting the equation as

$\, min_{A} \quad L(A) = Tr[W^T M_S W] - Tr[W^T M_D W] = Tr[W^T (M_S - M_D) W]$

such that

$\,Tr(A) =1$ or $\,Tr(W W^T) =1$

Now we use Lagrange Multiplier:

$\,Tr(W^T (M_S - M_D) W) - \lambda(Tr(W W^T)-1) = 0$

We get:

$\,(M_S - M_D)W = \lambda W$

It implies that $\,W$ are eigenvectors of the matrix $\,M_S - M_D$

or the solution is the P smallest eigenvectors of the matrix $\,M_S - M_D$

Hence we can find $\, A=W W^T$ easily.

##### Advantages over Distance Metric Learning

Aside from its convenient form, this formulation has other advantages over that used by Xing et al., especially with respect to the side information they convey Ghodsi et al

Xing et al. require at least one dissimilar pair in order to avoid a solution where all distances are zero. On the other hand, the constraint employed on the trace in this approach means that no restrictions need to be placed on the pairings. Side information can consist of similar pairs only, dissimilar pairs only, or any combination of the two.

In the absence of specific information regarding dissimilarities, Xing et al. assume that all points not explicitly identified as similar are dissimilar. This may be misleading and may force the algorithm to separate points that are actually similar. The formulation in Ghodsi et al. approach uses only the side information one actually has, partitioning the pairings into similar, dissimilar, and unknown.

##### Embeddings

Two sets of informed embeddings were generated from the same data set, each with side information pertaining to two different characteristics. The embeddings were generated with and without a preprocessing step. The preprocessing step informs the preprocessed embeddings about the characteristic of interest. It uses the new distances calculated with the learned metric.

In this experiment, all similar pairs were identified but no dissimilar pairs (five-pairs beards and all-pairs glasses omitted). The pairs were selected at random. The "Equivalence-Informed MDS" uses the preprocessor.

Inspection (of this and other Figures) reveals that, in general, the informed versions of the embeddings manage to separate the data based on the target property, whereas the uninformed versions are typically chaotic. Side information can be used to capture characteristics of interest where uninformed embedding techniques fail to automatically discover them. Moreover, two different characteristics (beards and glasses) have been captured within the same data set, even when using only small quantities of class-equivalence side information.

Generally, preprocessing is effective with a wide range of embedding techniques.

#### Dimensionality issue with this approach

We showed previously that W satisfies

$\,(M_s - M_D)W = \lambda W$

So W is composed of eigenvectors of the matrix $\,M_S-M_D$.

However, all these eigenvectors have to correspond to the same eigenvalue $\,\lambda$.

This implies that the rank of the matrix $\,M_S - M_D$ is 1.

So the dimensionality of the reconstructed data is only of dimension 1.

This is due to the constraint $\,Tr(W W^T) = 1$ that we picked initially. We can modify this constraint so that the rank of the solution matrix $\,A$ is greater than 1.

#### Solution with other constraints

In order to get more eigenvalues, possible new constraints were proposed.

One of them is to let

$\, W^T W = I$

We follow similar procedure as the original approach.

Using the Lagrange Multiplier we get:

$\,Tr(W^T (M_S - M_D) W) - Tr(\Lambda (W^T W - I)) = 0$

$\,(M_S - M_D)W = \Lambda W$, where $\,\Lambda$ is a diagonal matrix containing the coefficients.

Since we are minimizing the objective function. The solution for $\,W$ will be p smallest eigenvectors of
$\,M_S - M_D$

Now that we learned that changing our constraint will change our solution, we can think of this constraint more carefully. Another constraint we can have is:

$\,W^T M_S W = I$

Basically, rescale each direction by the variance of points in that direction.

#### Relation to FDA

Recall that FDA solution is the eigenvectors of $\mathbf{S_w^{-1} * S_B}$ Where

$\mathbf{S_w = \sum_c\sum_{i \in c}(x_i - \mu_c) * (x_i - \mu_c)^T}$

$\mathbf{S_B = \sum_c(\hat{x} - \mu_c) * (\hat{x} - \mu_c)^T}$

$\mathbf{S_w}$ is called within class co-variance. $\mathbf{S_B}$ is called between class co-variance.

Recall MCML solution is the eigenvectors of MS - MD.

Conceptually FDA and MCML are similar as FDA evaluate eigenvectors of within class (similar to S similar objects in MCML) and between class (similar to D different objects in MCML).

## Non-negative Matrix Factorization Via Rank-one Doundate(NMF) (Lecture 13: Oct 27, 2014)

The following was finished with reference to Nonnegative matrix factorization via rank-one downdate (Ghodsi et al 2008) [28]. The video of the presentation for this paper can be found at this link [29]

### Background

Several problems in information retrieval can be posed as low-rank matrix approximation. The seminal paper by Deerwester et al. (1990) on latent semantic indexing (LSI) [30] showed that approximating a term-document matrix describing a corpus of articles via the SVD led to powerful query and classification techniques. A drawback of LSI is that the low-rank factors in general will have both positive and negative entries, and there is no obvious statistical interpretation of the negative entries.

Non-negative matrix factorization might be useful for image segmentation, such that all the basis images are non-negative. For example, below we see two basis images. The one on left contains only non-negative entries. The one on right does not have this constraint.

We see the photo on the left looks more like a normal face, since faces do not have negative entries.

Another application is text analysis. We can use a term frequency matrix to represent the number of times a word appears in different pieces of text. A non-negative matrix is suitable here because a word cannot appear in a given text a negative number of times. Other possible applications include DNA analysis and retrieving notes from polyphonic music.

### History

Nonnegative matrix factorization has its roots in the work of Gregory and Pullman (1983), Paatero and Tapper (1994) and Cohen and Rothblum (1993). Lee and Seung (1999) proposed nonnegative matrix factorization (NMF) as used commonly today: given a matrix ARm×n, we want to find non-negative matrix factors WRm×k and HRk×n such that $A\approx WH$, both W and H have nonnegative entries, and k ≤ min(m, n). In a related work, Hofmann (1999) demonstrated the application of NMF to text retrieval.

Since the problem is NP-hard (Vavasis, 2007), it is not surprising that no algorithm is known to solve NMF to optimality. Heuristic algorithms proposed for NMF have generally been based on incrementally improving the objective $\,\left\|A - WH\right\|$ in some norm using local moves.

### Perron-Frobenius Theorem

#### Perron-Frobenius Theorem

Leading singular vectors of a non-negative matrix are non-negative.

This immediately suggests that there is a trivial rank-one NMF:

Compute $[U,\, \Sigma,\, V] = svd(A)$ ;
$W = U_{:1} \,$;
$H = \Sigma_{11} V_{:1}^T$;


Therefore, $W H = U_{:1} \Sigma_{11} V_{:1}^T \approx A$.

However this is a rank one approximation of the matrix $A$ which is not useful in many cases

#### Power Method for Singular Value Decomposition

Power method computes the leading singular vectors / values (or eigenvectors / eigenvalues) of a matrix using an iterative approach. First, it chooses a random solution and then iteratively tries to improve the answer until the answer converges.

In the algorithm below, we make a random guess for the vector u (for example, a vector of ones). The algorithm then repeats until u, σ, v converge.


Algorithm for SVD using power method

input A ∈ R m×n and k ≤ min(m, n)
output U, Σ, V
1: for µ = 1 to k do
2:   Select a random nonzero u ∈ Rm
3:   u = u/‖u‖
4:   repeat {power method}
5:     v = ATu/‖ATu‖
6:     σ = ‖Av‖
7:     u = Av/‖Av‖
8:   until stagnation in u, σ, v
9:   A = A − uσvT
10:  U(:, µ) = u
11:  V (:, µ) = v
12:  Σ(µ, µ) = σ
13:end for



Note that there is no guarantee that the entries of the residual matrix A in line 9 are still non-negative, therefore this algorithm cannot be used directly as an NMF algorithm for k ≥ 2.

One intuitive idea is that we can replace all negative entries of the residual matrix A with 0, but this is obviously not a very good method.

#### First Attempt to NMF

Modify the above algorithm following our intuitive idea, we have the following algorithm for NMF based on power method:


Algorithm for NMF based on power method

input A ∈ R m×n and k ≤ min(m, n)
output W, H
1: for µ = 1 to k do
2: Select a random nonzero u ∈ Rm
3:   u = u/‖u‖
4:   repeat {power method}
5:     v = ATu/‖ATu‖
6:     σ = ‖Av‖
7:     u = Av/‖Av‖
8:   until stagnation in u, σ, v
9:   A = A − uσvT
10:  Replace negative entries of A with 0
11:  U(:, µ) = u
12:  V (:, µ) = v
13:  Σ(µ, µ) = σ
14:end for
15:W = UΣ
16:H = VT



The problem is that when changing the negative entries of the residual matrix $A-u_1\sigma_1v_1$ to zero, we cannot assure that the modified matrix is similar to the original residual matrix.

### Similarity to Clustering

Given $[u, σ, v] = powermethod(A)$, our second observation is that singular vectors can be used to cluster rows and columns of A. Namely, similar entries in u correspond to similar rows of A, and similar entries in v correspond to similar columns of A.

For the reasoning behind, we can refer to spectral clustering, or the algorithm of power method, where we have $u = \frac{Av}{ || Av || }$ and $v = \frac{A^T u}{ || A^T u || }$ . Note that $u(i) = A(i, :)v$, entries of u can be view as cosine similarity between rows of A with v. If we cluster u and v and replace the entries in the cluster with smaller mean with 0, then we are reconstructing only part of the matrix A.


Algorithm for Modified Power Iteration

1: Find initial v using power method
2: while not converged
3:   u = Av / ‖Av‖
4:   cluster u and remove the cluster with smaller mean
5:   v = ATu / ‖ATu‖
6:   cluster v and remove the cluster with smaller mean
7:   σ = ‖ATu‖
8: end



### Rank-One Downdate (R1D)

Objective: find M, N such that A(M, N) is large and approximately rank-one: $A(M, N) \approx \boldsymbol{u \sigma v^T}$. You can find the reference paper of Rank-One Downdate here [31].

Therefore the objective function can be written as:

$max\, f(M, N, \boldsymbol{u}, \boldsymbol{\sigma}, \boldsymbol{v}) = \| A(M, N) \|_F^2 - \gamma \| A(M, N) - \boldsymbol{u}\boldsymbol{\sigma} \boldsymbol{v}^T \|_F^2$

This optimization problem is separable, so for each row i:

$max\, \| A(i, N) \|_F^2 - \gamma \| A(i, N) - \boldsymbol{u}\boldsymbol{\sigma} \boldsymbol{v}^T \|_F^2$

If N and $\boldsymbol{v}$ are given, then M and $\boldsymbol{u}$ can be computed via least square method, and vice versa.

In the second term $\| A(i, N) - \boldsymbol{u}\boldsymbol{\sigma} \boldsymbol{v}^T \|_F^2$, assume N and $\boldsymbol{v}$ are known, we have $\boldsymbol{u\sigma} = A(i, N)\boldsymbol{v}$ by least square method.

$\| A(i, N) \|_F^2 - \gamma \| A(i, N) - \boldsymbol{u}\boldsymbol{\sigma} \boldsymbol{v}^T \|_F^2$

$= A(i, N)A(i, N)^T - \gamma \left( A(i, N)-A(i, N)\boldsymbol{v}\boldsymbol{v}^T\right) \left( A(i, N)-A(i, N)\boldsymbol{v}\boldsymbol{v}^T\right)^T$

$= A(i, N)A(i, N)^T - \gamma \left( A(i, N)A(i, N)^T + A(i, N)\boldsymbol{v}\boldsymbol{v}^T\boldsymbol{v}\boldsymbol{v}^T A(i, N)^T -2A(i, N)\boldsymbol{v}\boldsymbol{v}^TA(i, N)^T \right)$

$= A(i, N)A(i, N)^T - \gamma \left( A(i, N)A(i, N)^T - A(i, N)\boldsymbol{v}\boldsymbol{v}^TA(i, N)^T \right)$

$= (1 - \gamma)A(i, N)A(i, N)^T + \gamma A(i, N)\boldsymbol{v}\boldsymbol{v}^TA(i, N)^T$

$= (1 - \gamma)A(i, N)A(i, N)^T + \gamma \left( A(i, N)\boldsymbol{v} \right)^2$

Therefore,

$f(M, N, \boldsymbol{u}, \boldsymbol{\sigma}, \boldsymbol{v}) = \sum_{i} \gamma \left( A(i, N)\boldsymbol{v} \right)^2 + (1 - \gamma)A(i, N)A(i, N)^T$

where each item in the summation represents the contribution of a row of A in the objective function. Intuitively, we can go through each row of A, calculate its contribution, and keep those that have positive contribution and drop those that have negative contribution.

Algorithm R1D

input A ∈ Rm×n, k > 0
output W ∈ Rm×k, H ∈ Rn×k
1: for µ = 1 to k do
2:   [M, N, u, v, σ] = ApproxRankOneSubmatrix(A)
3:   W(M, µ) = u(M)
4:   H(N, µ) = σv(N)
5:   A(M, N) = 0
6: end for


The ApproxRankOneSubmatrix procedure could be formulated as follows:


Algorithm ApproxRankOneSubmatrix

input A ∈ Rm×n, parameter γ > 1
output M ⊂ {1,...,m}, N ⊂ {1,...,n},
u ∈ Rm, v ∈ Rn, σ ∈ R
1: Select j0 ∈ {1, . . . , n} to maximize ‖A(:, j0)‖
2: M = {1,..., m}
3: N = {j0}
4: σ = ‖A(:, j0)‖
5: u = A(:, j0)/σ
6: repeat
7:   Let v = A(M, :)Tu(M)
8:       N = {j: γv(j)2 - ‖A(M, j)‖ > 0}
9:       v(N) = v(N)/‖v(N)‖
10:  Let u = A(:, N)v(N)
11:      M = {i: γu(i)2 - ‖A(i, N)‖ > 0}
12:      σ = ‖u(M)‖
13:      u(M) = u(M)/σ
14:until stagnation in M, N, u, σ, v


The basic idea would be repeatedly select a subset of rows and columns, based on the criteria whether they could contribute (as positive) to the sum of the rows/columns.

The R1D algorithm is quite scalable, as it does not need all the data to be kept in memory at once.

### Some Results

The results in this section comes from [32].

In Tables 1 and 2 result of LIS and R1D algorithms are presented on the TDT Pilot Study (TDT Study, 1997). The columns of each table are the leading columns of W, with the leading terms per column displayed.

Table 1- Topics found by LSI on the TDT pilot study corpus

The LSI results show that the topics are not properly separated and terms from different topics recur or are mixed.

Table 2- Topics found by R1D on the TDT pilot study corpus

The columns in the R1D table are clearly identifiable topics, and the terms in each columns are all correctly associated with the given topics.

The following figure shows a simple binary image dataset (a), where each of the images is composed of one or two basis "triangles". The leftmost column of (b) and (c) illustrates the four leading columns of W which can be interpreted as the learned features. For each feature the images with the largest entries in the corresponding columns in H is showed to the right of each feature. The below figure shows that R1D can learn the correct basis (triangles as features and assign the suitable image to that base while the results of LSI are can not be interpreted (This may be due to assigning negative weights).

The below figure shows the results of R1D and LSI on Freyface dataset where also figures (b) and (c) illustrates the eigenfaces (leading columns of W) of discovered by R1D and LSI respectively. And the images with max entries in the corresponding columns in H. This figure shows that R1D have interpretable results more than LSI.

## Text/Image Biclustering, Nyström Approximation, and Clustering (Lecture 14: Oct 29, 2014)

### Text and Image Biclustering

In this section, we discuss two different methods for clustering of images and text: LSI (Latent Semantic Indexing), and R1D.

#### LSI (Latent Semantic Indexing)

Latent Semantic Indexing is one method for grouping text or images together. In the case of text, the LSI algorithm takes as input a set of documents as a matrix. The $\, m$ columns of this matrix can represent $\, m$ documents, and the $\, n$ rows of this matrix represent all of the words present in the entire document set. The $\, ij^{th}$ entry of the input matrix generally represents the number of times that term $\, i$ appears in document $\, j$. It is common to rescale the raw frequencies of the input matrix to combat noisy data.

LSI proceeds to group the data by expressing the input matrix $\, X$ as $X \approx W * H^T$.

$\, W$ is an $\, n \times p$ matrix, and it can be viewed as a basis for the set of documents. $\, H$ is an $\, m \times p$ matrix, and it can be viewed as the expression of the basis elements (i.e. the coefficients for each document expressed in the $\, W$ basis). $\, p$ is a parameter, and is taken as an integer that is much smaller than $\, m$ or $\, n$.

Determining the $\, W$ and $\, H$ matrices is done identically to PCA except that we do not need to assume that the $\, X$ matrix has zero mean. This is also equivalent to taking a Singular Value Decomposition of the data.

In the case that the $\, X$ matrix is considered to be text, we interpret the basis to be different topics that are present in the document corpus. We can then cluster documents based on the coefficients used to represent them in this basis. Conversely, we can cluster terms based on their similarity to the basis vectors.

Unfortunately, this type of method does not do the best job of distinguishing topics, as seen in the example used in class.

#### R1D

R1D is another method for grouping text or images (or other sets of data). R1D is very similar to Non-negative Matrix Factorization, and text clustering for R1D can be formulated similarly to the above.

As seen in the example in class, R1D does a very good job of extracting distinct, important features from a large set of data. In both the image segmentation and text clustering examples, R1D significantly outperformed LSI in interpretation and differentiation of features. R1D is scalable, and can easily be applied to large data sets. The implementation only requires one row or column of the data in memory at any given time.

### Nyström Approximation

Many of the algorithms we have seen thus far are not scalable on large data sets. Finding the eigenvalue decomposition of an $\,n$ by $\,n$ matrix for large $\,n$ is very computationally expensive. A solution to this is the Nyström Approximation. The aim of Nyström approximation is that for large scale of data set, we randomly choose a subset of data using the result of singular decomposition of this subset of kernel matrix to approximate the whole singular decomposition of its kernel matrix.

Suppose that we have n data points $\, x_i$ for $\, i = 1, 2, ..., n$ and we select $\,m$ data points randomly ($\,m\lt n$).

Without loss of generality, we write this data set in this form:

$\, X = \begin{bmatrix} x_1, x_2 ,..., x_m, x_{m+1},..., x_n \end{bmatrix}$

Then, we let $\, R = \begin{bmatrix} x_1, x_2 ,..., x_m\end{bmatrix}$ and $\, S = \begin{bmatrix} x_{m+1}, x_{m+2} ,..., x_n\end{bmatrix}$

The data set can be written as

$\, X = \begin{bmatrix} R&S\end{bmatrix}$.

Now we'll compute the kernel matrix

$\, K = X^TX= \begin{bmatrix} R^T\\S^T\end{bmatrix}\begin{bmatrix} R&S\end{bmatrix}=\begin{bmatrix} R^TR&R^TS\\S^TR&S^TS\end{bmatrix}$.

Let $\,A= R^TR$, $\,B= R^TS$ and $\,C= S^TS$ where $\,A$ is an $\,m$ x $\,m$ matrix, $\,B$ is an $\,m$ x $\,(n-m)$ matrix, and $\,C$ is an $\,(n-m)$ x $\,(n-m)$ matrix. Then,

$\, K =\begin{bmatrix} R^TR&R^TS\\S^TR&S^TS\end{bmatrix}=\begin{bmatrix} A&B\\B^T&C\end{bmatrix}_{n \times n}$.

Now, since $\, A=R^TR$ is s.p.d (symmetric positive definite), we can apply SVD to matrix $\, A$:

$\, A = U{\Sigma}U^T=U{\Sigma}^\frac{1}{2}{\Sigma}^\frac{1}{2}U^T = R^TR$

where $R = {\Sigma}^\frac{1}{2}U^T$.

Therefore, we can use this outcome to calculate $\, S$:

$B = R^TS = U{\Sigma}^\frac{1}{2}S$

U is orthonormal, so we can multiply both sides by $\, U^T$ and we get:

$U^TB = U^TU{\Sigma}^\frac{1}{2}S = {\Sigma}^\frac{1}{2}S$

${\Sigma}^{-\frac{1}{2}}U^TB = S$

therefore $C = S^TS = B^TU {\Sigma}^{-\frac{1}{2}} {\Sigma}^{-\frac{1}{2}}U^TB = B^TU{\Sigma}^{-1}U^TB = B^TA^{-1}B$

Note that $\, A = U{\Sigma}U^T$ so the inverse matrix is$\, A^{-1} = U{\Sigma}^{-1}U^T$

Thus, we have $K = \begin{bmatrix} A&B\\B^T&B^TA^{-1}B\end{bmatrix}$

Thus, knowing the first $\,m$ rows of the Kernel matrix allows us to compute the rest of matrix using this method. Also, the approximation is exact if the rank of the Kernel matrix $\, K \lt = m$.

#### Example: Apply Nyström approximation to MDS

Recall that MDS starts with an $n \times n$ distance matrix $\,D$. We compute $K= \frac{-1}{2}HDH$. Then $Y = {\Lambda}^\frac{1}{2}V^T$ where $\,{\Lambda}$ are the eigenvalues, and $\,V$contains the eigenvectors of $\,K$.
Note: $\,D$ is a positive semi-definite matrix since it is Euclidean distance matrix.

To relax this into a Nyström problem we can write $\,D$ as a block matrix as follows:

$D = \begin{bmatrix} E&F\\F^T&G\end{bmatrix}$

Here, $\,E$ is $\,m \times \,m$ and represents the pairwise distances between $\,m$ selected points, and $\,F$ is an $\,m \times \,{(n-m)}$ matrix with the distances between the $\,m$ selected points and all other $\,n-m$ points. We do not need to compute the pairwise distance between all points as is normally required for MDS.

Once we have $\,E$ and $\,F$, we can approximate $\,A$ and $\,B$ in $\,K$ using centering formula:

$\, A_{ij} = \frac{-1}{2} ( E^2_{ij} - \frac{1}{m} \sum_i {E^2_{ij}} - \frac{1}{m} \sum_j { E^2_{ij}} + \frac{1}{m^2} \sum_{i,j}{ E^2_{ij}} )$

$\, B_{ij} = \frac{-1}{2} ( F^2_{ij} - \frac{1}{m} \sum_i {F^2_{ij}} - \frac{1}{m} \sum_j { E^2_{ij}} )$ to calculate elements of A and B.

(note that the constant centering term for $\,B$ is dropped, because it introduces an irrelevant shift of origin).

We can compute $\,C$ approximately using $\,{B^T A^{-1} B}$. The approximation is exact when $\,K$ is of rank $\,m$ or less.

The eigenvalues and eigenvectors of $\,K$ can be approximated by the eigendecomposition of $\,A$, using the top $\,d$ eigenvectors according to the desired dimensions. We set $\,{X = \Gamma ^{1/2} U^T}$ from the eigendecomposition of $\,A$, and then the coordinates corresponding to $\,B$ are $\,{Y = X^{-T}B = \Gamma^{-1/2} U^T B}$. Taken together, $\,X$ and $\,Y$ give the new $\,d$ dimensional coordinates of the transformed data.

In this way, we do not need to compute the (computationally expensive) eigendecomposition of the full matrix, but only that of the subselected $\,A$.

More details can be found in this paper: FastMap, MetricMap, and Landmark MDS are all Nystrom Algorithms. Current research is exploring which points when selected yield the best approximation. This paper also outlines the different calculation of the elements of B for Landmark MDS (LMDS):

$\, B_{ij} = \frac{-1}{2} ( F^2_{ij} - \frac{1}{m} \sum_p {E^2_{ip}} )$

This alternative is valid since the only change to B is a translation of all columns by a multiple of the vector of all ones. Since B is only used in dot products the shift does not effect the results.

#### Application of the Nyström Method to Image Segmentation

The Nystrom Method can be used for representing a full set of data by finding similarities between groups given a small sample of the full data. For example, Fowlkes, Belongie, Chung, and Malik (2004) utilised the Nystrom Method to segment images based on groupings rather than on pixels. Clustering by pixels is problematic for high resolution images. Instead, the Nystrom Method allows a small number of samples to represent the larger picture.

### Clustering

Cluster analysis groups data into clusters such that data within a cluster is similar and different from the data in other clusters. Clusters are better and more distinct when the similarity within a cluster is greater and the difference between clusters is greater as well. Cluster analysis groups data based on information found in the data and relationships between data points. In this way, cluster analysis is a type of unsupervised classification.

Clustering is not a well defined problem in general.

The image above demonstrates how a set of points can be divided into clusters in three different ways. The shapes of the markers indicate cluster membership. "This illustrates that the definition of a cluster is imprecise and that the best definition depends on the nature of data and desired results." ref

We represent a clustering function as $\,{\Gamma} = f(D)$.

$\,{\Gamma}$ is the partition of points, $\,D$ is the distance metric.

#### Properties We Want

There are 3 desirable properties we want in a good clustering function or algorithm $\,f$:

1. Scale invariant: $\,f(D) = f({\alpha}D)\; \forall scale\; \alpha$ (For example, scaling distance measure $\,D$ by a constant factor should not affect clustering.)
2. Richness: Any possible partitioning $\,{\Gamma}$ on $\,S$ should be a possible outcome of $\,f$.
3. Consistency: Assume $\,f(D)=\Gamma$. If D' is such that distances between points in the same cluster of D are reduced and distances between points in different clusters are increased, then $\,\Gamma =f(D')$.

#### Impossibility Theorem

It is impossible to satisfy all three above conditions at the same time for any $\,f$.

Generally we deem certain aspects less important and cluster anyway.

#### 3 Types of Clustering

1. Combinatorial algorithm - Clustering itself is a combinatorial problem. An example of Combinatorial algorithm is K-means clustering.
2. Mixture models - In mixture models we interpret data points as a sample from a mixed probability distribution model. An example of a "mixture model" is the Gaussian Mixture Model
3. Mode seeking - This method assumes no specific distribution however it uses a non-parametric function to describe the data in order to compute something similar in nature to the data given. One this is done, then we look for the mode of the distribution. Clustering is combinatorial by nature. For example, given 3 points, there are many ways to make 2 clusters of data. Combinatorial problems are generally very difficult to solve on large data sets, so we apply heuristics and relaxations to solve the problem sub-optimally. An example of "mode seeking" clustering is the Mean Shift Algorithm.

## Clustering (Lecture 15: Nov 3, 2014)

### Clustering

Clustering is the task of grouping objects into groups (clusters) such that objects between groups will be more similar than objects in the different clusters. To do clustering we will consider the similarity with the distance between points as follow :

• Suppose we have n data points that are indexed 1,...,n
• Suppose we need K clusters, k ∈ {1,2,...,K}
• We need to assign each point to one cluster k=C(i). Where C is an encoder
• The goal is to find a grouping of data such that the distances between points within a cluster tend to be small and distances between points in different clusters tend to be large.

### Combinatorial Algorithms

Consider now the sum of all the distances between all of the points. Call this distance T.

$T= \sum\limits_{k=1}^K \sum_{C(i)=k}(\sum_{C(j)=k} d_{ij} + \sum_{C(j) \neq k} d_{ij}) = \sum\limits_{k=1}^K \sum_{C(i)=k} \sum_{C(j)=k} d_{ij} + \sum\limits_{k=1}^K \sum_{C(i)=k} \sum_{C(j) \neq k} d_{ij} = w(C) + b(C)$

where w(C) is the within cluster distance and b(C) is the between cluster distance. Note that this summation double counts, but it double counts everything so it is still valid.

K-means minimizes within-cluster point scatter

We want to minimize w and maximize b. But since T is a constant and $T = w(C) + b(C)$, $\min\, w(C)=\max\, b(C)$ and vice versa. So from this we can conclude we either need to minimize W or maximize B. So our new goal is to minimize W. The number of distinct assignments (the complexity of the algorithm), for m points and k clusters is $S(n,k) = \frac{1}{k!} \sum_{k=1}^K (-1)^{K-k} {K \choose k} k^n$ which is very large so clearly we need a method with which to cluster.

### K-Means Clustering

K-means is one of the most popular and widely used clustering algorithms, which minimizes the variance within each cluster.

Idea: Take a greedy descent approach.

First initialize C(i) to a starting value.

Then, continue in such a way that the criterion W(C) improve at each step.

In the case of K-means clustering this is done by finding the C(i) (cluster centre) that is closest to each given data point at each step. Each data point is assigned to this cluster. Once this is done for all data points, we take each cluster and recalculate W(C) by taking the mean of all the points assigned to the given cluster. This step improves W(C) and we have just redefined C(i). Now, we repeat the step with the new C(i). The idea is that the points that are not clustered correctly will be very far from the new mean C(i), and they are likely to be assigned to a different cluster as the newly calculated mean of another cluster may be closer. This process is repeated until no points are assigned to a new cluster. This indicates that the algorithm will have converged.

$W(C)= \frac{1}{2} \sum\limits_{k=1}^K \sum_{C(i)=k} \sum_{C(j)=k} d_{ij} = \frac{1}{2} \sum\limits_{k=1}^K \sum_{C(i)=k} \sum_{C(j)=k} || x_{i} - x_{j} || ^2 = \sum\limits_{k=1}^K \eta_{k} \sum_{C(i)=k} || x_{i} - \bar{x}_{k} || ^2$ Where $\eta_{k}$ is the number of points in cluster k.

Aside: For any set of observations S: $\bar{x}_{S} = argmin_{m} \sum_{i \in S} ||x_{i}-m||^2$

Now, $c_\ast = argmin_{C} \sum_{k=1}^K \eta_k \sum_{C(i)=k} ||x_{i} - \bar{x}_k||^2 = argmin_{C, \{m_k\}_{k=1}^K} \sum_{k=1}^K \eta_k \sum_{C(i)=k} ||x_{i} - m_k||^2$ (2)

A very important property of the above objective function is the role of $\,\eta_k$. You may have seen some other objective functions in other sources (including wikipedia page of K-means Algorithm [33] and even MATLAB kmeans function) in which there is no such variable in the objective function. The role of this variable here, is basically preventing a cluster to have so many points while the other cluster has few ones. Of course there are many cases which this phenomenon happens inevitably but, if a data set has some how a symmetric structure, there should not be a cluster that has so many more points than the other(s).

Just for clarification, consider this exaggerated example. Suppose we have 1000 data points and we want to group them into two clusters. Assume that if we use the objective function without $\,\eta_k$ coefficient, we would end up to two clusters, one of them with 980 points and the other 20 points. Assume again that the sum of squared distance to clusters' centers for cluster one is 500 and for cluster two is 10. Now, if we evaluate this instance of clustering with the above objective function with coefficient $\,\eta_k$, it will be 980*500+20*10 = 490200. But, assume that if we take 180 points from cluster one which has the maximum distance from cluster one's center and minimum distance from cluster two'center (of course this maximum distance from cluster one's center is still less than the minimum distance to cluster two's center because they are grouped in cluster one), the value of sum squared distances will become 280 for cluster one and 240 for cluster two(including the effect of change in centers position). These 180 points are actually located some where in between two clusters' centers. Now the objective function will become 800*280+200*240 = 272000 which is much less than before. Finally, it is worthwhile to say that however adding $\,\eta_k$ to our objective function makes our clusters to share data points in a more fairer manner, it also makes this optimization problem even harder.

#### Optimization

We can optimize (2) in two steps:

• Given C, (2) is minimized by letting ${m_1, ... , m_k}$ to be the centroid of all data points in each cluster.
• Given $\{m_k\}_{k=1}^K$,￼￼ (2) is minimized by assigning each $x_i$ to the closest $m_k$
• So $c_\ast = argmin_{1 \leq k \leq L} | x_{i} - \bar{x}_{k} |$

This is basically k-means clustering.

#### Iterative algorithm

• Initialize: pick K random points as cluster centers
• Alternate:
1. Assign data points to closest cluster center.
2. Change the cluster center to the average of its assigned points.
• Stop when no points' assignments change.

This applet illustrates the steps of K-means

File:kmeansViz.png
Example of K-means clustering algorithm with k=2. Image obtained from http://stanford.edu/~cpiech/cs221/handouts/kmeans.html

#### Implementation Details and Extensions

Selecting the number of classes and initial points to start with are very important. If a bad number of initial points are chosen, we may end up losing some clusters in the essence of combining two of more clusters into one and if too small a K is selected then we will create almost sparse clusters or clusters with just one member.

There are a bunch of techniques to choose the proper initial values to prevent this phenomena, and also some other techniques that converges much faster than the classical k-means method as described above.

• The Iterative Self-Organizing Data Analysis Technique (ISODATA) algorithm can help us to find the number of clusters to start with.
• The algorithm "K-means++" can be used to find proper initial points.
• Other techniques applies the kernel trick to K-means, and produced kernel k-means algorithm.

### Expectation-Maximization Algorithm

#### Motivation: Mixture Model

Another method of clustering is the usage of a mixture model. In general, when presented with a set of data points, one intuitive method to organize these data points is to fit a predetermined model or distribution to the data points then use the given data points to estimate the parameters of the model or distribution. One common distribution is the Gaussian, or Normal, distribution. However one major drawback of this is that the spread of the data has only one mode (unimodal), thus, since the reason for clustering is to organize the data into separate sections, the usage of only one mode gives little benefit. However, this concept can be extended by assuming that there exists a mixture of multiple Gaussian distributions in which an unknown parameter determines which one a certain data point is sampled from. This concept is known as the Mixture Model which is illustrated in the following figure, where the unknown parameter (Z) is used to select one of the Gaussian distributions X. For example if we using a mixture of two Gaussians Z will have a Bernoulli distribution and according to its result one of the two distributions can be selected.

Mixture of Gaussians

It is more obvious to see as follows:

1) We get the observation and we assume these points are from Gaussian distribution

2) The cluster appears that points are close to other data

3) We can then fit a mixture of Gaussian distributions

For example, you roll a dice. The dire rent result correspond different Gaussian distribution.

An interesting extension to the method would be to learn the number of clusters in the data by considering the data as being made of k clusters or of j clusters where k is less than j. This allows the mixture of k normals to be a submodel of the mixture of j normals. As one is a submodel of the other, and the log-likelihoods of each are easy enough to compute, one may perform a Likelihood Ratio Test to determine if the model with fewer clusters is sufficient to explain the data relative to the model with more clusters. One may then work backwards from 1 cluster per datum (which has perfect fit and so the likelihood is 1) towards a more reasonable number of clusters, or to work forwards from 1 cluster to a more reasonable number of clusters.

#### Review of Maximum Likelihood Estimator

This note from New York University is a good tutorial on MLE.

If the form of the distribution P is known but its parameters $\ \theta$ are unknown, we can write likelihood function and maximize it with respect to $\ \theta$ Consider a cluster of the data that only has one mode. One of the most popular methods to determine the parameters of a distribution used to fit these data points is Maximum Likelihood Estimation (MLE). Essentially the given data points are assumed to be samples from the distribution $\ f(x;\theta)$ where $\ \theta$ denotes the parameters of the distribution. Assuming we know what $\ \theta$ is and that all sample points are independent from each other we calculate the total probability of sampling the given points given the distribution and parameters as follows:

$\ P(X_{i},\theta) = L(X_{i};\theta)$, and for the whole data set, $L(X;\theta) = \prod_{i=1}^nP(X_i;\theta)$. However, we usually calculate the log likelihood instead as it is much easier to work with. It is defined as $\ l(X;\theta) = \sum_{i=1}^n log P(X_i;\theta)$

Example:

$\ P(X;\theta) = N(\mu,{\sigma}^{2})$ where $\ \theta = \{\mu,\sigma\}$

$\ P(X; \theta) = \frac{1}{\sqrt{2\pi}\sigma}{e}^{\frac{-{(x - \mu)}^{2}}{{2\sigma}^{2}}}$

$\ L(X;\theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}{e}^{\frac{-{(x_{i} - \mu)}^{2}}{{2\sigma}^{2}}}$

$\ l(X;\theta) = \sum_{i=1}^n log(\frac{1}{\sqrt{2\pi}\sigma}) - \frac{{(x_{i} - \mu)}^{2}}{{2\sigma}^{2}}$

Thus we need to maximize only the last part as the rest is constant:

$\ \sum_{i=1}^n \frac{-{(x_{i} - \mu)}^{2}}{{2\sigma}^{2}} = \frac{-1}{{2\sigma}^{2}}\sum_{i=1}^n {(x_{i} - \mu)}^{2}$

taking the derivative and setting it equal to zero gives the estimates:

$\ \mu = \frac{\sum_{i=1}^n x_{i}}{n}$, $\ {\sigma}^{2} = \frac{\sum_{i=1}^n{(x_{i}-\mu)}^{2}}{n}$

#### Expectation Maximization (EM) Algorithm

Another way of looking at the method the data is generated by is to think about it where there is a hidden variable. You see the observed variables $\{x_{i}\}_{i=1}^n$, but there exists some latent variables $\{z_{i}\}_{i=1}^n$ that you never observe. So we have original data in the form of $\{(x_{i}, z_{i})\}_{i=1}^n$. A popular method to solve this kind of problem is called the EM algorithm.

In short, the EM algorithm is a way to maximize the likelihood when a part of the data is missing.

• We can write $\,l(x; \theta)$ which is likelihood of observation.
• We can also define $\,l_{c}(x, z; \theta)$ which is called complete likelihood. x is observed but z is hidden. $\,l_{c}(x, z; \theta)$ is the log likelihood of joint distribution of observed and missing data.

$\,l(x; \theta) = log P(x; \theta)$
$\,l_{c}(x, z; \theta) = log P(x, z; \theta)$

• We can write the likelihood of observation $l$ in terms of the complete likelihood $l_{c}$:

$\,l(x, \theta) = log \sum_{Z} P(x, z; \theta)$

##### Idea

$\,log \sum_{Z} P(x, z; \theta)$ is the likelihood of our optimization. Calculating the log of a sum is difficult, so we want to change it to a sum of logs.

Suppose this function is greater than or equal to another function; that is, the second function is the lower bound of the first. Then maximizing the first function is equivalent to maximizing the lower bound, or what we can call the auxiliary function. When we are maximizing the auxiliary function we can say that it has two terms. One of the terms is the expectation of $\,l_{c}(x, z; \theta)$ which we treat as random variable.

The EM algorithm has two alternating steps:

1. Estimate $\, z$, which makes it is easy to optimize for $\, \theta$ because it separates the set into different subsets. Now you know the parameters of the components.
2. Knowing the parameters, what is the best estimate for $\, z$? Then knowing z what is the best guess for the parameters.

Iterate between these two.

EM is more general than computing mixture of Gaussian.

##### Pros and Cons of EM algorithm

Pros:

• You don't need to choose a step size
• The algorithm is guaranteed to converge after a finite number of iterations
• The algorithm is fast

Cons:

• Sometimes the algorithm only results in a local maxima
• The algorithm is sensitive to the initialization of parameters

[34] Shows that K-Means is really just the EM (Expectation Maximization) algorithm applied to a particular naive bayes model.

## EM algorithm and Mixture Model (Cont) (Lecture 16: Nov 5, 2014)

### EM algorithm (Continue from last lecture)

Let's suppose variable X is observed and variable Z is missing. So our observed log likelihood for X is $l(\theta ; x) = \log P(x;\theta)$. And our complete likelihood function is $l_c(\theta; x, z) = \log P(x, z; \theta)$ (where complete means it includes missing data). We can consider observed likelihood function as the marginal density function of the complete likelihood function that sum over (or integrate over, if the variable Z is continuous) Z. Therefore we can write $l(\theta, x) = \log(\sum p(x, z, \theta))$. This formula has the form of log of sum, which is difficult to manipulate in optimization problem. Therefore, we seek a way to transform this formula in terms of sum of log function. We can use Jenson inequality to help us perform this transformation.

#### Jensen Inequality

The most commonly seen Jensen inequality [35] is defined in term of a convex function. For our purpose, since we are going to apply Jenson inequality towards a concave log function, we need to write it in terms of a concave function.

Theorem: Let f be a convex function, and let X be a random variable. Then: $\,E[f(X)] \geq f(EX)$.

We randomly pick two points a and b within the domain of the concave function, and $\, \alpha \in [0,1].$. Therefore $\, \alpha a + (1-\alpha)b$ is a point between $\, a$ and $\, b$. By Jenson's inequality for concave function (Theorem), we have the following inequality correct: $\, f(\alpha a + (1- \alpha)b) \geq \alpha f(a) + (1-\alpha) f(b)$.

#### Jenson inequality within EM Algorithm

To apply Jenson's inequality to the EM algorithm, we first rewrite the observed likelihood formula as:

$l(\theta; x) = \log \sum_z q(z|x) \frac{p(x,z,\theta)}{q(z|x)}$, where $\ q(z|x)$ is a any valid probability density function of z conditioning on x. Note that log is a concave function, hence by Jenson's inequality, we arrive at the following inequality:

$\log \sum_z q(z|x) \frac{ p(x,z;\theta)}{q(z|x)} \geq \sum_z q(z|x) \log \frac{p(x,z;\theta)}{q(z|x)}$.

We define the second part of inequality as auxiliary function $L(q,\theta) = \sum_z q(z|x) \log \frac{p(x,z;\theta)}{q(z|x)}$. Since the auxiliary function is a lower bound of the likelihood function we want to maximize, instead of maximizing the likelihood function, we can maximize the auxiliary function instead.

#### EM Algorithm in General

The EM Algorithm consists of the E step and M step, which are defined as:

E-step (expectation): Using the current value of parameter $\theta$, find the distribution $q(z|x)$, for the latent variable z.

$\ q^{t+1} = \arg \max_q L(q, \theta^t)$

M-step (maximization): Maximize the expected value of $log P(x,z | \theta)$ with respect to $\theta$, where the expectation is with respect to the distribution $q(z|x)$ that found in the E step

$\ \theta^{t+1} = \arg \max_\theta L(q^{t+1}, \theta)$

The upper $t$ and $t+1$ notation means the current arguments is at $t'th$ iteration.

In EM algorithm, we iterate through these two steps, until they converge. They are guarantee to converge, since at each iteration we will increase the maximum. However there is no guarantee that they will converge to a global maximum, since the summation of log function may not be a global concave function.

We can expand auxiliary function as $L(q,\theta) = \sum_z q(z|x) \log \frac{p(x,z; \theta)}{q(z|x)}$

$\ = \sum_z q(z|x) [ \log p(x, z, \theta) - \log q(z|x) ]$

$\ = \sum_z q(z|x) \log p(x,z; \theta) - \sum_z q(z|x) \log q(z|x)$.

#### Detail of M-step

Since the second term of $L(q, \theta)$ expansion don't contains $\theta$, it is constant with respect to $\theta$. When we maximize $L(q, \theta)$ in M-step with respect to $\theta$, we don't need to consider it.

$\mathbf{\sum_{z}q(z|x)\log p(x,z,\theta) = E[l_c(\theta, x, z)]_{q(z|x)}}$

which is the expectation of complete likelihood function with respect to q(z|x).

#### Detail of E-step

Claim: if $q(z|x)$ is $p(z|x;\theta)$, then $L(q, \theta)$ is maximize with respect to q.

Proof: Substitute $p(z|x;\theta)$ in the auxiliary function.

$\, L(q, \theta) = \sum_z q(z|x) \log \frac{p(x,z,\theta)}{q(z|x)}$ $\, = \sum_z p(z|x, \theta) \log \frac{p(x,z;\theta)}{p(z|x;\theta)}$ $\, = \sum_z p(z|x, \theta) \log \frac{p(z|x;\theta)p(x;\theta)}{p(z|x;\theta)}$ $\, = \sum_z p(z|x, \theta) \log p(x;\theta)$ $\, = log p(x;\theta) \sum_z p(z|x; \theta)$ $\, = p(x; \theta)$ $\, = l(x;\theta)$

Since $\, l(\theta;x) \leq L(q,\theta) = l(\theta;x)$ when $\, q(z|x) = p(z|x;\theta)$. From the previous equation we know that $\, l(\theta;x)$ is the upper bond of the auxiliary function Therefore we can conclude that $\, q(z|x) = p(z|x;\theta)$ is the optimal solution of E step.

#### Trivial Example

assume $\, Y = (y_{1},y_{2})$, where $y_{1}=5$,observed value of $y_{2}$ is missing.

$\, P(y) = \theta {e}^{- \theta y}$

The goal is to estimate $\theta$. This is trivial since we can just do log likelihood as follows:

#### Log Likelihood Solution

$\, L(x;\theta) = \theta{e}^{- \theta 5}$

$\, l(x;\theta) = \log(\theta) - 5\theta$

$\, \frac{\partial l}{\partial \theta} = \frac{1}{\theta} - 5 = 0$ -> $\frac{1}{\theta} = 5$

$\, \theta = \frac{1}{5}$

#### EM Solution

Say instead we want to use EM algorithm. We can do so as follows:

$\, E[l_{c} (\theta,y_{1},y_{2}) ]_{p(y_{2}|y_{1};\theta)}$

$\,L_{c} (\theta,y_{1},y_{2}) = \theta {e}^{- 5 \theta} \theta {e}^{- y_{2} \theta}$

$\,l_{c} (\theta,y_{1},y_{2}) = 2\log \theta - 5 \theta - y_{2} \theta$

$\,E[l_{c} (\theta,y_{1},y_{2}) ]_{p(y_{2}|y_{1};\theta^{t})} = 2 log(\theta) - 5 \theta - E[y_{2}]_{p(y_{2}|y_{1};\theta^{t})}$

Note: $\,E[y_{2}]_{p(y_{2}|y_{1};\theta^{t})} = E[y_{2}]_{p(y_{2};\theta^{t})} = \frac{1}{\theta^{t}}$ since $y_i's$ are independent.

$\,E[l_{c}] = 2log (\theta) - 5\theta - \frac{\theta}{\theta^{t}}$

$\,\frac{\partial E[l_{c}]}{\partial \theta} = \frac{2}{\theta} - 5 - \frac{1}{\theta^{t}} = 0$

$\,\frac{2}{\theta} = \frac{1}{\theta^{t}} + 5 = \frac{1 + 5 \theta^{t}}{\theta^{t}}$

$\,\theta^{t+1} = \frac{2 \theta^{t}}{1 + 5 \theta^{t}}$

%Matlab apporach for EM algorthm in example
theta = 12; % initial
theta_o = 11;
while abs(theta - theta_o) ~= 0 % check if it converages
theta_o = theta;
theta = 2*theta/(1+5*theta); % theta for t+1
end
theta %return theta


$\theta$ = 0.200

The value of $\theta$ from EM algorithm is the same as the value when we do likelihood.

#### Mixture of Gaussian

Gaussian Mixture Models (GMM) are using a multivariate Gaussian mixture density. Expectation Maximization (EM) algorithm is used to fit data and estimate parameters for GMMs. GMMs is often used for data clustering, and sometimes it is considered as a soft clustering method. Similar as k-means clustering, GMMs classify different clusters by using an iterative algorithm that converges to a local optimal. If the clusters have different size and correlations within them, GMM should be more appropriate than k-means clustering. GMMs are also commonly used as a parametric model of the probability distribution of continuous measurements or features in a bio-metric system.

A Gaussian mixture model is a weighted sum of M component Gaussian distributions,where

$p(x|\lambda) = \sum_{i=1}^{M} w_{i}N(x|\mu_{i},\Sigma_{i})$

where $w_{i}$, i = 1,...,M, are the mixture weights, and $N(x|\mu_{i},\sigma_{i}^{2})$, are component Gaussian distributions, since we have

$N(x|\mu_{i},\sigma_{i}^{2}) = \frac {1}{2\pi^{D/2}|\Sigma_{i}|^{\frac {1}{2}}}\exp (- \frac {1}{2}(x-\mu_{i})'\Sigma_{i}^{-1}(x-\mu_{i}))$

with mean vector $\mu_{i}$ and covariance matrix $\Sigma_{i}$. The mixture weights must satisfy the constraint that $\sum_{i=1}^{M} w_{i} = 1$.

Then expectation maximization algorithm is applied to estimate parameters of training data.

For instance, assume data is generated as mixture of two Normal Distribution

$\alpha N (x;\mu_{1},\sigma^{2}_{1}) + (1 - \alpha) N (x;\mu_{2},\sigma^{2}_{2})$

where:

$\,N (x;\mu_{1},\sigma^{2}_{1}) = \Phi_{1}(x)$,

$N (x;\mu_{2},\sigma^{2}_{2}) = \Phi_{2}(x)$.

So:

If $\,z = 1$ then we sample from $\,\Phi_{1}(x)$, with $\, P(z=1)= \alpha$.

If $\,z = 0$ then we sample from $\,\Phi_{2}(x)$, with $\, P(z=0)= 1- \alpha$.

Then we define the join distribution $\,\Phi (x) = \alpha^{z} (1-\alpha)^{1-z} \Phi_{1}(x)^{z} \Phi_{2}(x)^{1-z}$

$\,L_{c} (\theta;x,z) = \prod_{i=1}^{n} \alpha^{z_{i}} (1 - \alpha)^{(1 - z_{i})} \Phi_{1}(x_{i})^{z_{i}} \Phi_{2}(x_{i})^{1-z_{i}}$

$\,l_{c} (\theta;x,z) = \sum_{i=1}^{n} (z_{i} log(\alpha) + (1 - z_{i}) log(1 - \alpha) + z_{i} log(\Phi_{1}(x_{i})) + (1 - z_{i}) log(\Phi_{2}(x_{i})))$

Now we are going to maximize this expectation $\,E[l_{c} (\theta;x,z)]_{p(z|x;\theta)}$

$\,p(z|x;\theta) = \frac {p(x|z;\theta) p(z;\theta)}{\sum p(x|z;\theta) p(z;\theta)}$

More details will be taught in next lecture.

## EM algorithm and Mixture Model (Cont) (Lecture 17: Nov 10, 2014)

### Assignment 2 Solutions

Overall the second assignment was done better that the first one. The coding was done quite well (less crazy pictures than in the first assignment). More issues for the written questions.

### Mixture of Gaussian

Suppose that we have a mixture of 2 Gaussian models, such that $\alpha N(x;\mu_{1},\sigma^{2}_{1}) + (1-\alpha)N(x;\mu_{2},\sigma^{2}_{2})$. The general idea of this model is that we have a random variable Z which we can use to decide the distribution of the corresponding X. Let's suppose variable X is observed and variable Z is missing. We have $\, z=1$ or $\, z=0$ (Usually we assume Z has Bernoulli distribution with the probability function $\,P(z = 0)=(1-\alpha)$.

If $\,z = 1$ then we sample from $\,\Phi_{1}(x)$

If $\,z = 0$ then we sample from $\,\Phi_{2}(x)$

We also choose an arbitrary distribution $\, q(z|x)$.

$\, P(x_{i}|z_{i})=\begin{cases} N(x;\mu_{1},\sigma^{2}_{1}) = \Phi_{1}(x) & \mbox{if } z_{i}=1 \\ N(x;\mu_{2},\sigma^{2}_{2}) = \Phi_{2}(x) & \mbox{if } z_{i}=0 \end{cases}$

We can write the complete likelihood function : $\,L_{c} (x,z,\theta) = \prod_{i=1}^{n} P(x_{i},z_{i},\theta )$

So the log-likelihood is : $\,l_{c} (x,z,\theta) = \sum_{i=1}^{n} log P(x_{i},z_{i},\theta )$

We can write the join distribution of this model : $\,\Phi = \alpha^{z_{i}} (1-\alpha)^{1-z_{i}} \Phi_{1}^{z_{i}} \Phi_{2}^{1-z_{i}}$

Note that if $\, z_{i}=1$ then $\,\Phi =\Phi_{1}$ and if $\, z_{i}=0$ then $\,\Phi =\Phi_{2}$

Now we can compute the log-likelihood : $\,l_{c} (x,z,\theta) = \sum_{i=1}^{n}( z_{i}log(\alpha) + (1-z_{i})log(1-\alpha)+ z_{i}log(\Phi_{1}) + (1-z_{i})log(\Phi_{2}))$

Our goal is to maximize the expectation of the log-likelihood but to do that we have to know the expectation of $\, z_{i}$ corresponding to the distribution $\, q(z|x)$. The previous lecture concluded that the best choice for $\, q(z|x)$ is $\, p(z|x;\theta)$. Assuming that we know this and that $\, E[z_{i}]= w_{i}$ (This is the expectation of z under the distribution $\, p(z|x;\theta)$ .

We now have : $\,E[l_{c} (x,z,\theta)] = \sum_{i=1}^{n} (w_{i}log(\alpha) + (1-w_{i})log(1-\alpha)+ w_{i}log(\Phi_{1}) + (1-w_{i})log(\Phi_{2}))$

### M-Step

We want to maximize the expectation of the log-likelihood with respect to $\, \theta= ( \alpha , \mu_{1} , \mu_{2} , \sigma^{2}_{1} , \sigma^{2}_{2} )$ assuming that we know the expectation of $\ w_i$ (which is found in the previous E step). We do this by taking the derivative with respect to each component of $\ \theta$ and setting them equal to zero. This is case, since we are assuming we know the value of $\ w_i$, it is considered to be a constant with respect to each component of $\ \theta$.

• Find $\, \alpha$

$\, \frac{\partial E[l_{c}]}{\partial \alpha} = \sum_{i=1}^{n} \frac{w_{i}}{\alpha} - \frac{1-w_{i}}{1-\alpha} = 0$

$\, \sum_{i=1}^{n} \frac{(1-\alpha)w_{i} -\alpha (1-w_{i} ) }{\alpha (1 - \alpha)} = 0$

$\, \sum_{i=1}^{n} \frac{w_{i} -\alpha }{\alpha (1 - \alpha)} = 0$

$\, \sum_{i=1}^{n} w_{i} -\alpha = 0$

$\, \sum_{i=1}^{n} w_{i} =\sum_{i=1}^{n} \alpha = n\alpha$

Thus $\, \alpha = \frac{\sum_{i=1}^{n} w_{i}}{n}$

If we know $\, w_{i}$, we can find the optimal $\, \alpha$.

• Find $\, \mu_{1}$ and $\, \mu_{2}$

Recall : $\, \phi_{1}(x_{i})= \frac{1}{\sqrt{2\pi}\sigma_{1}}{e}^{\frac{-{( x_{i} - \mu_{1})}^{2} }{{2\sigma}^{2} } }$

So $\, w_{i} log( \Phi_{1} ) = w_{i} [ log ( \frac{1}{\sqrt{2 \pi} \sigma_{1} }) - \frac{ (x_{i} - \mu_{1} )^{2} }{2\sigma_{1}^{2} }]$

$\, \frac{\partial E[ l_{c} ] }{\partial \mu_{1}} = \sum_{i=1}^{n} w_{i} \frac{ (x_{i} - \mu_{1})}{\sigma_{1}^{2}} = 0$

$\, \sum_{i=1}^{n} \frac{ w_{i} x_{i} - w_{i} \mu_{1}}{\sigma_{1}^{2}} = 0$

$\, \sum_{i=1}^{n} w_{i} x_{i} - w_{i} \mu_{1} = 0$

$\, \sum_{i=1}^{n} w_{i} x_{i} = \sum_{i=1}^{n} w_{i} \mu_{1} =\mu_{1} \sum_{i=1}^{n} w_{i}$

Thus $\, \mu_{1} = \frac{ \sum_{i=1}^{n} w_{i} x_{i} }{ \sum_{i=1}^{n} w_{i}}$

We can do the same for $\, \mu_{2}$ and we obtain $\, \mu_{2} = \frac{ \sum_{i=1}^{n}(1 - w_{i}) x_{i} }{ \sum_{i=1}^{n} (1-w_{i})}$

• Find $\, \sigma_{1}$ and $\, \sigma_{2}$

Recall : $\, \phi_{1}(x_{i})= \frac{1}{\sqrt{2\pi}\sigma_{1}}{e}^{\frac{-{( x_{i} - \mu_{1})}^{2} }{{2\sigma}^{2} } }$

So $\, w_{i} log( \Phi_{1} ) = w_{i} [ log ( \frac{1}{\sqrt{2 \pi} \sigma_{1} }) - \frac{ (x_{i} - \mu_{1} )^{2} }{2\sigma_{1}^{2} }]$

$\, \frac{\partial E[ l_{c} ] }{\partial \sigma_{1}} =\sum_{i=1}^{n} \frac{-w_{i}\sqrt {2\pi}}{\sqrt {2\pi}\sigma_{1}} + \frac{2w_{i}{( x_{i} - \mu_{1})}^{2}}{{2\sigma}^{3}} = 0$

$\,\sum_{i=1}^{n} w_{i}{\sigma}^{2} = \sum_{i=1}^{n} w_{i} ( x_{i} - \mu_{1})^{2}$

$\, \sigma_{1} ^{2}= \frac{ \sum_{i=1}^{n} w_{i} (x_{i} - \mu_{1})^{2} }{ \sum_{i=1}^{n} w_{i}}$

and $\, \sigma_{2} ^{2}= \frac{ \sum_{i=1}^{n} (1- w_{i} )(x_{i} - \mu_{2})^{2} }{ \sum_{i=1}^{n}(1- w_{i}) }$

The M-Step allows us to compute $\, \theta= ( \alpha , \mu_{1} , \mu_{2} , \sigma^{2}_{1} , \sigma^{2}_{2} )$ if we assume that $\, w_{i}$ is known.

### E-Step

Note that according to Bayes rule: $P(y|x) = \frac{ P(x|y)P(x) }{ P(x) }$

The E-Step allows us to compute $\, w_{i}$ if we assume that $\, \theta= ( \alpha , \mu_{1} , \mu_{2} , \sigma^{2}_{1} , \sigma^{2}_{2} )$ is known.

Let's go back to the expectation of $\, z_{i}$. Recall $\, z_{i}=1$ or $\, z_{i}=0$.

$\, E(z_{i})=1*P(z_{i}=1|x;\theta) + 0*P(z_{i}=0|x;\theta)= P(z_{i}=1|x;\theta)$

$\, E(z_{i})= p(z_{i}=1|x;\theta) = \frac{ P(x | z_{i}=1;\theta)P(z_{i}=1) }{ P(x) }$ (by Bayes' theorem)

Note that $\, P(z_{i}=1) = \alpha$ and $\, E(z_{i})= w_{i}$.

So $\, w_{i} = \frac{ \alpha N(x_{i};\mu_{1},\sigma^{2}_{1}) }{ \alpha N(x_{i};\mu_{1},\sigma^{2}_{1}) + (1-\alpha) N(x_{i};\mu_{2},\sigma^{2}_{2}) }$

### Termination

Iteration will end if $l_{c}(x,z,\theta^{t})$ and $l_{c}(x,z,\theta^{t-1})$ are close enough.(Where t is the time of iteration.)

### Idea of EM-algorithm

The expectation–maximization (EM) algorithm is an iterative method for finding the maximum likelihood or maximum a posterior (MAP) estimates of parameters when missing data occurs. The E represent Expectation step and M represent Maximization step. The expectation step uses the current estimates for parameters to create a new function for the expectation of the log-likelihood evaluated. The maximization step uses the expected log-likelihood found in expectation step, and computes parameters which maximize the expected log-likelihood. Then these estimated parameters are used in the E step. The iteration ends until the solution converges. It could be done as follows:

1. We choose random initial values and we alternatively iterate the M-step and E-step.
2. If we start we a random $\, \theta$ , we do the E-Step and compute $\, w_{i}$.
3. Then we do the M-Step and compute $\, \theta$ and so on until it converges (= parameters do not change).

(In each iteration, the value of parameters will change. But the form of distribution q will not change) You can also think of EM algorithm in terms of graphical models where random variable x depends on random variable z. if z = 0 (like flipping a coin) then x is generated from normal distribution of $\, \mu_1$ and $\, \sigma_1$ and if $\, z = 1$, then x is generated from normal distribution of $\, \mu_2$ and $\, \sigma_2$. The following figure shows the graphical model representation.

EM algorithm on Gaussian Mixture model is a popular tool for clustering purposes or density estimation. Given an n-dimensional data set, one can start clustering with some reasonable initial guess about the number of clusters (which will be the number of Gaussian components) and relative covariance and mean for each of these clusters and then run the EM algorithm.

### Convergence of EM-algorithm

As the number of iterations of the M step and E step increase, the results will converge which makes this algorithm valid. Intuitively the likelihood function monotonically increases, so it will always have the maximum value. In this paper, proof is given in detail.

## Feature Selection (Lecture 18: Nov. 12, 2014)

### Introduction

Feature selection (variable selection or attribute selection) is a process for selecting a subset of existing features which best represent the data in a lower dimensional space. This is a similar but more generalized version of the concept feature extraction. Feature extraction is the process of generating new features from the features of the original data where as feature selection simply returns a subset of features. Feature selection is mostly used when the number of feature is too high (sometimes more than number of samples, e.g. DNA microarrays) and computation becomes too expensive when processing the data. Feature selection assumes that some of the features are redundant and cannot add any further information to the data, or some of the features are irrelevant, i.e. they do not have any useful information about the data (for example when you are processing the data of patients with heart disease, the patient ID is irrelevant)

### Mixture of Gaussian continued

Code for a small example of a mixed Gaussian model in Matlab:

x1 = mvnrnd([1;2]',eye(2),1000);
x2 = mvnrnd([3;4]',5*eye(2),1000);
x = [x1,;x2];
[p,mu,phi,Pxtr] = mdgEM(x,2,100,0.1)


This code forms a mixture of two Gaussian models, with 2 distinct means, then tries to use the EM algorithm to separate the points into two classes.

That was a more obvious example. What about the following circumstances?

Question: The data points are not Gaussian distributed.
Answer: Even if the data is not Gaussian distributed, we can still use mixture of Gaussian to do the clustering.

Question: The two Gaussian distributions have same center but with different variances, and the number of the data points in each are not the same
Answer: If the number of data points are not the same, the results of the EM method above will assume that the number of data in two clusters are the same. The algorithm works decently well as a density modelling function when the centers are the same, however this is not useful for clustering.

### Definitions and Relation to Previous Works

A Feature is defined as a prominent characteristic. In the context of statistics, features are vectors representing aspects (attributes) of the data. Feature selection is the process of judging the importance of features and selecting features that are most relevant. Principal components can be associated with features of the data with most variation. This leads to the idea of PCA-based feature selection and sparse PCA-based feature selection, where we use PCA and sparse PCA to extract the principle components (features).

### Unsupervised Feature Selection

A criterion for feature selection is to minimize the error of the reconstructed data matrix based on a set of selected features. This is called Unsupervised Feature Selection. The following has been summarized using the following pdf, [36].

Definition: Let A be a m X n matrix whose rows represent the set of data instances and whose columns represent the set of features, S a set of features, and $P^{(s)}$ the projection of A onto the columns of S

The unsupervised feature selection criterion is $\, F(s) = \| A - P^{(S)} A\|_F^2$

#### Problem 1

Find a subset of features $\,L$ such that:

$\,L={arg min}_{S}F(S)$

where

$\, F(s) = \| A - P^{(S)} A\|_F^2$ is the feature selection criterion and

$\,P^{(s)} = A_{:S}(A_{:S}^TA{:_S})^{-1}A_{:S}^T$ is the projection matrix

Proof

Approximate $\,A$ as $\,A_{:S} T$

Given $\,A_{:S}$ solve the following least square problem for $\,T$

$\, \text{min} ||A - A_{:S}T||_F$

$\,T = (A_{:S}^T A_{:S})^{-1} A_{:S}^T A$

$\,\tilde{A} = A_{:S}(A_{:S}^T A_{:S})^{-1} A_{:S}^T A$

### Recursive Selection Criterion

A recursive formula for feature selection criterion can be developed. This formula is based on a recursive formula for the projection matrix $P^{(S)}$

#### Lemma 1

Given a set of features, for any S, P a subset of S we have the following;

$\, P^{(S)} = P^{(P)} + R^{(R)}$

where,

$\, E = A - P^{(P)}A$

$\, R = S \setminus P$

$\, R^{(R)}$ is a projection matrix which projects the columns of E onto the span of the subset R of columns:

$\, R^{(R)} = E_{:R}(E_{:R}^TE_{:R})^{-1}E_{:R}^T$

Proof

We define the following matrices:

$\, P^{(S)} = A_{:S}(A_{:S}^{T}A_{:S})^{-1}A_{:S}^{T}$
$\, A_{:S} = \begin{bmatrix} A_{:P} & A_{:R} \end{bmatrix}$
$\, B = A_{:S}^{T}A_{:S}$

B can also be written as:

$\, B = \begin{bmatrix} B_{PP} & B_{PR} \\ B_{PR}^{T} & B_{RR} \end{bmatrix}$
where $\ B_{PP} = A_{:P}^{T}A_{:P},$ $\ B_{PR} = A_{:P}^{T}A_{:R},$ $\ B_{RR} = A_{:R}^{T}A_{:R}$

Let S be the Schur complement of $\, B_{PP}$ in B. $\, S = B_{RR} - B_{PR}^{T}B_{PP}^{-1}B_{PR}$

Now we compute the following:

$\, P^{(S)} = \begin{bmatrix} A_{:P} & A_{:R} \end{bmatrix} \begin{bmatrix} B_{PP}^{-1}+B_{PP}^{-1}B_{PR}S^{-1}B_{PR}^{T}B_{PP}^{-1} & -B_{PP}^{-1}B_{PR}S^{-1} \\ -S^{-1}B_{PR}^{-1}B_{PP}^{-1} & S^{-1} \end{bmatrix} \begin{bmatrix} A_{:P}^{T} & A_{:R}^{T} \end{bmatrix}$

$\, P^{(S)} = A_{:P}B_{PP}^{-1}A_{:P}^{T} + (A_{:R}-A_{:P}B_{PP}^{-1}B_{PR})S^{-1}(A_{:R}^{T}-B_{PR}^{T}B_{PP}^{-1}A_{:P}^{T})$

We can see that:

$\, P^{(P)} = A_{:P}B_{PP}^{-1}A_{:P}^{T}$

$\, E_{:R} = A_{:R}-A_{:P}B_{PP}^{-1}B_{PR}$

$\, S^{-1} = (E_{:R}^{T}E_{:R})^{-1}$

By substitution we can see that:

$\, P^{(S)} = P^{(P)} + E_{:R}(E_{:R}^{T}E_{:R})^{-1}E_{:R}^{T}$

And so,

$\, P^{(S)} = P^{(P)} + R^{(R)}$

#### Corrollary 1

Given a matrix A and a subset of columns S. For any P⊂S

$\,\tilde{A}_{S}=\tilde{A}_P+\tilde{E}_R$

$\,\tilde{A}_{S}=P^{(S)}A$

$\,E=A-P^{(P)}A$

$\,\tilde{E}_{:R}=E_{:R}(E_{:R}^TE_{:R})^{-1}E_{:R}^TE$

##### Proof

Using Lemma 1, we compute:

$\, \tilde{A}_{S} = P^{P}A + E_{:R}(E_{:R}^TE_{:R})^{-1}E_{:R}^TA$

The first term is the low-rank approximation of A based on $\, \tilde{A}_P = P^{(P)}A$.

The second term is equal to $\, \tilde{E}_{R}$ as $\, E_{:R}^{T}A = E_{:R}^{T}E$.

We can show this by multiplying $\, E_{:R}^{T}$ by E. This gives:

$\, E_{:R}^{T}E = E_{:R}^{T}A - E_{:R}^{T}P^{(P)}A$

Using $\, E_{:R} = A_{:R} - P^{(P)}A_{:R}$ , we can write:

$\, E_{:R}^{T}P^{(P)} = A_{:R}^{T}P^{(P)} - A_{:R}^{T}P^{(P)}P^{(P)}$

This expression is equal to zero since $\, P^{(P)}P^{(P)} = P^{(P)}$

This means that $\, E_{:R}^{T}E = E_{:R}^{T}A$, thus proving that the second expression is equal to $\, \tilde{E}_R$ and so completing our proof.

#### Theorem 1

Given a set of features, for any S, P a subset of S we have the following:

$\, F(S) = F{(P)} - \| E_{R} \|_F^2$

where,

$\, E = A - P^{(P)}A$

$\, R = S \setminus P$

$\, E_R = E_{:R}(E_{:R}^TE_{:R})^{-1}E_{:R}^TE$

##### Proof

$\, F(S) = \| A - \tilde{A}_{S} \|_{F}^{2} = \| A - \tilde{A}_{P} - \tilde{E}_{R} \|_{F}^{2} = \| E - \tilde{E}_{R} \|_{F}^{2}$

We can rewrite this expression with the trace function,

$\, \| E - \tilde{E}_{R} \|_{F}^{2} = trace((E - \tilde{E}_{R})^{T}(E - \tilde{E}_{R}))$

$\, =\gt trace(E^{T}E - 2E^{T}\tilde{E}_{R} + \tilde{E}_{R}^{T}\tilde{E}_{R})$

As $\, R^{(R)}R^{(R)} = R^{(R)}$, we can write the following expression,

$\, \tilde{E}_{R}^{T}\tilde{E}_{R} = E^{T}R^{(R)}R^{(R)}E = E^{T}R^{(R)}E = E^{T}\tilde{E}_{R}$

This results in:

$\, F(S) = \| E - \tilde{E}_{R} \|_{F}^{2} = trace(E^{T}E - \tilde{E}_{R}\tilde{E}_{R}) = \| E \|_{F}^{2} - \| \tilde{E}_{R} \|_{F}^{2}$

We just replace $\, \| E \|_{F}^{2}$ with $\, F(P)$ to prove the theorem.

### Greedy Selection Criterion

#### Problem 2

At iteration t, find feature l such that,

$\, l={arg min}_iF(S \cup {i})$

S is the set of the features selected during the first t-1 iterations.

This implementation is of order $O(m^2n^2)$ floating-point operations per iteration. This is not very efficient for large data sets. A more efficient approach is to use the recursive formula from Theorem 1 for calculating the reconstruction error.

$F (S \cup \{i\}) = F (S) - \|\tilde{E}_{\{i\}}\|^2_F$

where

$\, E= A - \tilde{A}_s$

$\,l=arg max \|\tilde{E}_{\{i\}}\|^2_F$

$\, \|\tilde{E}_{\{i\}}\|^2_F=trace(\tilde{E}_{\{i\}}^T\tilde{E}_{\{i\}})=trace(E^TR^{({i})}E)$

$\, =trace(E^TE_{:i}(E^T_{:i}E_{:i})^{-1}E^T_{:i}E)$

$\, =\frac{1}{E^T_{:i}E_{:i}}trace(E^TE_{:i}E^T_{:i}E)=\frac{\|E^TE_{:i}\|^2}{E^T_{:i}R_{:i}}$

#### Problem 3

At interation t, find feature l such that,

$\, l=\frac{\|E^TE_{:i}\|^2}{E^T_{:i}R_{:i}}$

where

$\, E= A - \tilde{A}_s$

S is the set of the features selected during the first t-1 iterations

#### Problems With Greedy Feature Selection

Greedy feature selection can be very calculation intensive at each iteration. This makes it memory intensive and computationally complex, making its run time extremely long

#### Problem 4 (Partition based Greedy Feature Selection)

See Efficient Greedy Feature Selection for Unsupervised Learning paper for more details.

At interaction t, find feature l such that,

$\, l=\frac{\|F^TE_{:i}\|^2}{E^T_{:i}E_{:i}}$

where

$\, E= A - \tilde{A}_s$

S is the set of set of features selected during the first t-1 iterations

$\, F_{:j}= \sum_{r \in P_j}E_{:r}$

$\, P=\{P_1,P_2,...,P_c\}$ is the partition of features into c groups

solution: Determine the best feature to represent the centroids of each of the group. Update the formulas for $f$ and $g$

### Conclusion

This work presents a novel greedy algorithm for unsupervised feature selection. It shows a feature selection criterion which measures the reconstruction error of the data matrix based on the subset of selected features and forms a recursive formula for calculating the feature selection criterion.We compute an efficient greedy algorithm for the feature selection, and two memory and time efficient variants.

We have shown that the proposed algorithm achieves better clustering performance and is less computationally demanding than the methods that give comparable clustering performances.

### Supervised Feature Selection

Like we introduced Supervised PCA based on PCA, there is also a supervised version of Feature Selection. Supervised Feature Selection also use the Hilbert-Schmidt independence criterion (HSIC) to measure the dependence between features and labels.Song et al proposed this method in 2007.

### Gap statistics

Gap statistics [37] is a method of choosing the optimal number of clusters for a set of data. The idea is to divide the data in to n clusters and compute the within class distance of the n clusters k-means. As the number of clusters (n) increases, the within class distance shrinks because points are clustered with more similar points.

At a certain point, increasing the number of cluster won't cause a noticeable decrease in the within class distances because there are enough clusters to accommodate the difference in data and there is no merit in adding more. To determine the optimal number of clusters, graph the number of clusters vs the within cluster sum of squares. Generally the point at which the elbow occurs in the plot is the number of clusters you should use.

### Affinity Propagation

Suppose S is a similarity measure. Affinity propagation minimizes the cost function:

$S(c)=\sum^N_{i=1}S(i,c_i)+\sum\delta_k(c)$

The first sum tells us if c is a good measure based on the other measures and the second sum tells us if c is a good measure based on itself.

$\delta_k(c)=\begin{cases} -\infty & \mbox{if } c_k \neq k \ but \ \exists{i} \ c_i=k \\ 0 & \mbox{otherwise } \end{cases}$

Essentially by minimizing the above cost function, we find the best "representative" of each cluster of data points by looking at which point does each individual point believe is the best representation for themselves. However the point that has been selected as the representative must also select itself. Otherwise the second term will severely punish the cost function.

## Learning Affine Transformations for Nonlinear Dimensionality Recuction (Lecture 19: Nov. 27, 2014)

### Affine Transformation

In PCA, it was quite easy to compute out of sample points (test set). However, this is not the case for nonlinear transformation (e.g. LLE) and there is no exact solution, only an approximation method. We address this problem by introducing a transformation that handles the nonlinear cases. Affine Transformation preserves the local distances between neighbors while pulling non-neighbors as far apart as possible. Affine transformation also do a good job dealing with out sample points which fails in Isomap and LLE method.

To unfold non-linear manifold, we define two different sets:

S = {(i,j)|$x_{i}$ and $x_{j}$ are neighbours}

O = {(i,j)|$x_{i}$ and $x_{j}$ are non neighbours}

$y = W^{T}x$, where W is our transformation matrix.

Since Distance of neighbouring points fixed,

$\sum_{(i,j) \in S} (\|y_{i} - y_{j}\|^{2} - \tau_{ij}^{2})^{2}$

Normalize:

$Err = \sum_{(i,j) \in S} (\| \frac {y_{i} - y_{j}} {\tau_{ij}} \| - 1)^{2}$

We try to minimize our Err function, since $y = W^{T}x$, we have

$Err = \sum_{(i,j) \in S} ((\frac {x_{i} - x_{j}}{\tau_{ij}})^{T}WW^{T}(\frac {x_{i} - x_{j}}{\tau_{ij}}) - 1)^{2}$

$= \sum_{(i,j) \in S} (\delta_{ij}^{T}A\delta_{ij} - 1)^{2}$

where $\delta_{ij} = \frac {x_{i} - x_{j}} {\tau_{ij}}$, and A = $WW^{T}$ which is positive semi-definite.

Let $\delta_{ij}^{T}A\delta_{ij} = vec(A)^{T} vec(\delta_{ij}\delta_{ij}^{T}) = vec(A)^{T}vec(\Delta_{ij})$

$vec(A) = vech(A)^{T}Qvech(A) - 2vech(A)^{T}p + |s|$, where $Q = \sum_{(i,j) \in S} \zeta_{ij}\zeta_{ij}^{T}$ and $p = \sum_{(i,j) \in S} \zeta_{ij}$

$vech = \alpha u + \beta \bar{u}$, where u is in the space S and $\bar{u}$ is in null space

then we want to pull non neighbours as far away as possible,

$max_{A \ge 0} \beta^{T} \bar(u)^{T} s$, where $s = \sum_{(i,j) \in O} \zeta_{ij}$

Kernelizing: (Let $W = X\Omega$ and $A_{\phi} = \Omega\Omega$)

$\|y_{i} - y_{j}\|^{2} = (x_{i} - x_{j})^{T}WW^{T}(x_{i} - x_{j})$

$= (X^{T}x_{i} - X^{T}x_{j})A_{\phi}(X^{T}x_{i} - X^{T}x_{j})$

$= (K(X,x_{i}) - K(X,x_{j}))A_{\phi}(K(X,x_{i}) - K(X,x_{j}))$

Embedding by Affine Transformation.

Given training set of n d-dimensional points $x_{i} \in R^{d}$ we wish to learn a d by d transformation matrix that will unfold the underlying manifold and embed into $y_{i}$ by:

$\mathbf Y = W^T X$

### Patch Alignment

2 clusters may overlap with each other in the original space. The end points, in the original dimensions, are connected and form a smooth curve. This property is not guaranteed when we map the data onto lower dimensions using some kernels. To connect the end points, rotation and translation matrices are used. The goal is to find the rotation and translation matrices that minimize the matching error of the two patches. Sometimes when we plot the data, we may find a "hole" in it. The "hole" means the data set is not convex which means you can not apply LLE and Isomap method on it. And at this time, Patch Alignment method could be considered.

The intuition behind Patch Alignment is that by preprocessing data into over lapping clusters, dimensionality reduction techniques can be sped up by calculating the results on each cluster. Then, various techniques are used to smooth the edges between clusters, creating a final clean output.