# stat946s13

## Contents

## Sing up for paper presentation

Sign up for your presentation in the following table. Put your name and a link to the paper that you are going to present. Chose a date between July 16 and July 30 (inclusive).

Date | Presentation (1) | Presentation (2) |

July 16 | Ahmed Elgohary [1] Summary | name [link], Summary |

July 23 | Jiaxi Liang [2] Summary | Lu Xin [link], Summary |

July 25 | Fahmida Homayra[link]Summary | M.Hassan Z.Ashtiani [link], Summary |

July 30 | Huan Cheng[link]Summary |

## Set B

Name | Second paper (The paper that you are going to write a critic on it. This is different from the paper that you have chosen for presentation.) |

Ali Ghodsi (This is an example | A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization [3] Summary |

## paper summaries

## Proposal for Final Project

# Background

## Introduction

Manifold learning is a significant problem across a wide variety of information processing fields including pattern recognition, data compression, machine learning, and database navigation. In many problems, the measured data vectors are high-dimensional but we may have reason to believe that the data lie near a lower-dimensional manifold. In other words, we may believe that high-dimensional data are multiple, indirect measurements of an underlying source, which typically cannot be directly measured. Learning a suitable low-dimensional manifold from high-dimensional data is essentially the same as learning this underlying source.

Dimensionality reduction <ref>In this thesis ‘manifold learning’ and ‘dimensionality reduction’ are used interchangeably. </ref> can also be seen as the process of deriving a set of degrees of freedom which can be used to reproduce most of the variability of a data set. Consider a set of images produced by the rotation of a face through different angles. Clearly only one degree of freedom is being altered, and thus the images lie along a continuous curve through image space.

Manifold learning techniques can be used in different ways including:

- : Produces a compact low-dimensional encoding of a given high-dimensional data set.
- : Provides an interpretation of a given data set, usually as a by-product of data dimensionality reduction.
- : Unsupervised methods for data dimensionality reduction are used as a preprocessing step in order to simplify subsequent training of a supervised method such as classification.

Many algorithms for dimensionality reduction have been developed to accomplish these tasks. However, since the need for such analysis is raised in many areas of study, contributions to the field have come from many disciplines. While all of these methods have a similar goal, approaches to the problem are different.

Principal components analysis (PCA) is a classical method which provides a sequence of best linear approximations to a given high-dimensional observation. It is one of the most popular techniques for dimensionality reduction. However, its effectiveness is limited by its global linearity. Multidimensional scaling (MDS) , which is closely related to PCA, suffers from the same drawback. Factor analysis and independent component analysis (ICA) also assume that the underling manifold is a linear subspace. However, they differ from PCA in the way they model the subspace.

The subspace modeled by PCA captures the maximum variability in the data, and can be viewed as modeling the covariance structure of the data, whereas factor analysis models the correlation structure. ICA starts from a factor analysis solution and searches for rotations that lead to independent

components . In order to resolve the problem of dimensionality reduction in nonlinear cases, many techniques including kernel PCA , locally linear embedding (LLE) , Laplacian Eigenmaps Method (LEM) , Isomap , and Semidefinite Embedding (SDE) have been proposed.

This chapter provides a brief overview of these different approaches and shows their close connection. Section 2 of this chapter explains Principal components analysis which is the core of many other techniques. In Section 3, kernel PCA, a recent extension to PCA, is discussed. Locally linear embedding, a new and very popular algorithm, is reviewed in Section 4. Section 5 explains Laplacian Eigenmaps. Multidimensional scaling and its recent extension, Isomap, are discussed in Sections 6 and 7 respectively. The last section discusses Semidefinite Embedding, a new approach to dimensionality reduction based on Semidefinite programming.

## Principal components analysis

Principal components analysis (PCA) is a very popular technique for dimensionality reduction. Given a set of data on [math]n[/math] dimensions, PCA aims to find a linear subspace of dimension lower than [math]n[/math] such that the data points lie mainly on this linear subspace. Such a reduced subspace attempts to maintain most of the variability of the data.

The linear subspace can be specified by [math]d[/math] orthogonal vectors which form a new coordinate system and are called the ‘principal components’. The principal components are orthogonal, linear transformations of the original data points, so there can be no more than [math]n[/math] of them. However, the hope is that only [math]d \lt n[/math] principal components are needed to approximate the space spanned by the [math]n[/math] original axes.

The most common definition of PCA, due to Hotelling , is that, for a given set of data vectors [math]{x_i}[/math], [math]i \in {1 ... t}[/math], the [math]d[/math] principal axes are those orthonormal axes onto which the variance retained under projection is maximal.

In order to capture as much of the variability as possible, let’s choose the first principal component, denoted by [math]U_1[/math], to have maximum variance. Suppose that all centered observations are stacked into the columns of an [math]n \times t[/math] matrix [math]X[/math], where each column corresponds to an [math]n[/math] dimensional observation and there are [math]t[/math] observations. Let principal component be a linear combination of [math]X[/math] defined by coefficients (or weights) [math]W=[w_1 ... w_t][/math]. In matrix form:

[math]U_1=W^T X[/math] [math]var(U_1)=var(W^T X)= W^T S W[/math]

where [math]S[/math] is the [math]t \times t[/math] sample covariance matrix of [math]X[/math].

Clearly [math]var(U_1)[/math] can be made arbitrarily large by increasing the magnitude of [math]W[/math]. Therefore, we choose [math]W[/math] to maximize [math] W^T S W [/math] while constraining [math]W[/math] to have unit length.

[math]\begin{align} \max~W^T S W\\ subject~to~W^T W = 1\end{align}[/math]

To solve this optimization problem a Lagrange multiplier [math]\alpha_1[/math] is introduced:

[math]\begin{align} L(W, \alpha)= W^T S W - \alpha_1 ( W^T W - 1) \label{1}\end{align}[/math]

Differentiating with respect to [math]W[/math] gives [math]t[/math] equations,

[math]SW= \alpha_1W[/math]

Premultiplying both sides by [math]W^T[/math] we have:

[math]W^T S W = \alpha_1 W^T W = \alpha_1[/math]

[math]var(U_1)[/math] is maximized if [math]\alpha_1[/math] is the largest eigenvalue of [math]S[/math].

Clearly [math]\alpha_1[/math] and [math]W[/math] are an eigenvalue and an eigenvector of [math]S[/math]. Differentiating ([1]) with respect to the Lagrange multiplier [math]\alpha_1[/math] gives us back the constraint:

[math]W^T W =1[/math]

This shows that the first principal component is given by the normalized eigenvector with the largest associated eigenvalue of the sample covariance matrix [math]S[/math]. A similar argument can show that the [math]d[/math] dominant eigenvectors of covariance matrix [math]S[/math] determine the first [math]d[/math] principal components.

Another nice property of PCA, closely related to the original discussion by Pearson , is that the projection onto the principal subspace minimizes the squared reconstruction error, [math]\sum_{i=1}^t ||x_i- \hat{x}_i||^2[/math]. In other words, the principal components of a set of data in [math]\Re^n[/math] provide a sequence of best linear approximations to that data, for all ranks [math]d \le n[/math]

Consider the rank-[math]d[/math] linear approximation model as :

[math]f(y) =\bar{x} + U_d y[/math]

This is the parametric representation of a hyperplane of rank [math]d[/math].

For convenience, suppose [math]\bar{x} = 0[/math] (otherwise the observations can be simply replaced by their centered versions [math]\tilde {x} = x_i - \bar{x}[/math]). Under this assumption the rank [math]d[/math] linear model would be [math]f(y) = U_d y[/math], where [math]U_d[/math] is a [math]n \times d[/math] matrix with [math]d[/math] orthogonal unit vectors as columns and [math]y[/math] is a vector of parameters. Fitting this model to the data by least squares leaves us to minimize the reconstruction error:

[math]\begin{align} \min_{U_d,y_i} \;\sum_i^{t} || x_i -U_dy_i||^2\end{align}[/math]

By partial optimization for [math]y_i[/math] we obtain:

[math]\frac{d}{d y_i} \Rightarrow y_i=U_d ^T x_i[/math]

Now we need to find the orthogonal matrix [math]U_d[/math]:

[math]\begin{align} \min_{U_d} \; \sum_i^{t} || x_i -U_dU_d^T x_i||^2\end{align}[/math]

Define [math]H_d= U_d U_d^T [/math]. [math]H_d[/math] is a [math]n \times n[/math] matrix which acts as a projection matrix and projects each data point [math]x_i[/math] onto its rank [math]d[/math] reconstruction.

In other words, [math]H_d x_i[/math] is the orthogonal projection of [math]x_i[/math] onto the subspace spanned by the columns of [math]U_d[/math]. A unique [math]H^+[/math] solution can be obtained by finding the pseudo inverse of [math]X[/math], denoted as [math]X^+[/math].

[math]H^+ = X^+ X[/math] [math]X= U \Sigma V^T[/math] [math]X^+ = V \Sigma^+ U^T[/math] [math]H^+= U \Sigma V^T V \Sigma^+ U^T =UU^T[/math] For each rank [math]d[/math], [math]U_d[/math] consists of the first [math]d[/math] columns of [math]U[/math].

[h] [alg 1]

Clearly the solution for [math]U[/math] can be expressed as singular value decomposition (SVD) of [math]X[/math]. [math]X=U \Sigma V^T[/math] since the columns of [math]U[/math] in the SVD contain the eigenvectors of [math]XX^T[/math]. The PCA procedure is summarized in Algorithm 1.

## Kernel PCA

Through the use of kernels, principle components can be computed efficiently in high-dimensional feature spaces that are related to the input space by some nonlinear mapping. PCA is an orthogonal transformation of the coordinate system in which we describe our data.

Kernel PCA finds principal components which are nonlinearly related to the input space. PCA can be formulated entirely in terms of dot products between data points. In kernel PCA, this dot product is replaced by the inner product of a Hilbert space. This is equivalent to performing PCA in the space produced by the nonlinear mapping, where the low-dimensional latent structure is, hopefully, easier to discover.

Consider a feature space [math]\mathcal{H}[/math] such that:

[math]\Phi : x \rightarrow \mathcal{H} , x \mapsto \Phi(x)[/math]

Suppose [math]\sum_i^t \Phi(x_i) = 0 [/math] (we will return to this point and show how this condition can be satisfied in Hilbert space).

This allows us to formulate the kernel PCA objective as follows:

[math]\min~ \sum_i^t || \Phi(x_i)- U_qU_q^T \Phi(x_i) ||[/math]

By the same argument used for PCA, the solution can be found by SVD: [math]\Phi(X) = U \Sigma V^T[/math]

where [math]U[/math] contains the eigenvectors of [math]\Phi(X) \Phi(X)^T[/math]

However, the singular value decomposition allows us to do much more than simply rederive the principle components algorithm. In fact, given the matrices [math]\Sigma[/math] and [math]V[/math], one can derive a *dual* form of principle components analysis which allows us to limit the direct dependence on the original dimensionality [math]n[/math], via the kernel trick.

Assume that the dimensionality [math]n[/math] of the [math]n\times t[/math] matrix of data [math]X[/math] is large (i.e. [math]n\gt \gt t[/math]). In this case, Algorithm 1 is impractical. We would prefer a run time that depends only on the number of training examples [math]t[/math], or that at least has a reduced dependence on [math]n[/math].

To reduce the dependence on [math]n[/math], first assume that we have a kernel [math]k(\cdot,\cdot)[/math] that allows us to compute [math]k(x,y)=x^\top y[/math]. Given such a function, we can then compute the matrix [math]X^\top X =K[/math], such that [math]k_{ij}=k(x_i,x_j)[/math]. Let [math][X^\top X][/math] denote the fact that we could compute the matrix [math]X^\top X[/math] efficiently using the kernel trick.

The eigenvectors in [math]U[/math] corresponding to nonzero singular values in [math]\Sigma[/math] (square roots of eigenvalues) are in a one-to-one correspondence with the eigenvectors in [math]V[/math].

Now assume that we perform dimensionality reduction on [math]U[/math] and keep only the first [math]d[/math] eigenvectors, corresponding to the top [math]d[/math] nonzero singular values in [math]\Sigma[/math]. These eigenvectors will still be in a one-to-one correspondence with the first [math]d[/math] eigenvectors in [math]V[/math]: [math]X\;V \;\;=\;\; U\;\Sigma[/math] where the dimensions of these matrices are: [math]\begin{array}{cccc} X & U & \Sigma & V \\ n\times t & n\times d & d\times d & t\times d \\ & & \mbox{diagonal} \end{array}[/math] Crucially, [math]\Sigma[/math] is now square and invertible, because its diagonal has nonzero entries. Thus, the following conversion between the top [math]d[/math] eigenvectors can be derived:

[math]\begin{align} U&=&X\;V\;\Sigma^{-1}\end{align}[/math]

Replacing all uses of [math]U[/math] in Algorithm 1 with [math]XV\Sigma^{-1}[/math] gives us the dual form of PCA, Algorithm 2 (see Figure [alg 2]).

[h] [alg 2]

In the derivation of the kernel PCA we assumed that [math]\Phi(x)[/math] has zero mean. The following normalization of the kernel satisfies this condition.

[math]\tilde{k}(x,y)= k(x,y)-E_x[k(x,y)]-E_y[k(x,y)]+E_x[E_y[k(x,y)]][/math]

In order to prove that, define:

[math]\tilde{\Phi}(X) = \Phi(X) - E_x[\Phi(X)][/math]

Finally, the corresponding kernel is:

[math]\tilde{k}(x,y)=\tilde{\Phi}(x) \tilde{\Phi(y)}[/math]

This expands as follows:

[math]\tilde{k}(x,y)= (\Phi(x) - E_x[\Phi(x)]) . (\Phi(y) - E_y[\Phi(y)])[/math] [math]=k(x,y)-E_x[k(x,y)]-E_y[k(x,y)]+E_x[E_y[k(x,y)]][/math]

## Locally Linear Embedding

[LLE] Locally linear embedding (LLE), computes low-dimensional, neighborhood preserving embedding of high-dimensional data. A data set of dimensionality [math]n[/math], which is assumed to lie on or near a smooth nonlinear manifold of dimensionality [math]d \lt n[/math], is mapped into a single global coordinate system of lower dimensionality, [math]d[/math]. The global nonlinear structure is recovered by locally linear fits.

Consider [math]t[/math] [math]n[/math]-dimensional real-valued vectors [math]x_i[/math] sampled from some underlying manifold. We can assume each data point and its neighbors lie on, or are close to, a locally linear patch of the manifold. By a linear mapping, consisting of a translation, rotation, and rescaling, the high-dimensional coordinates of each neighborhood can be mapped to global internal coordinates on the manifold. Thus, the nonlinear structure of the data can be identified through two linear steps: first, compute the locally linear patches, and second, compute the linear mapping to the coordinate system on the manifold.

The main goal here is to map the high-dimensional data points to the single global coordinate system of the manifold such that the relationships between neighboring points are preserved. This proceeds in three steps:

- Identify the neighbors of each data point [math]x_i[/math]. This can be done by finding the [math]k[/math] nearest neighbors, or by choosing all points within some fixed radius, [math]\epsilon[/math].
- Compute the weights that best linearly reconstruct [math]x_i[/math] from its neighbors.
- Find the low-dimensional embedding vector [math]y_i[/math] which is best reconstructed by the weights determined in the previous step.

After finding the nearest neighbors in the first step, the second step must compute a local geometry for each locally linear patch. This geometry is characterized by linear coefficients that reconstruct each data point from its neighbors.

[math]\min_{W}\sum_{i=1}^{t}||\mathbf{x}_{i}-\sum_{j=1}^{k}W_{ij}\mathbf{x}_{N_{i}(j)}||^{2}[/math] where [math]N_{i}(j)[/math] is the index of the [math]j[/math]th neighbor of the [math]i[/math]th point. It then selects code vectors so as to preserve the reconstruction weights by solving

[math]\min_{Y}\sum_{i=1}^{t}||\mathbf{y}_{i}-\sum_{j=1}^{k}W_{ij}\mathbf{y}_{N_{i}(j)}||^{2}[/math] This objective can be reformulated as

[math]\begin{align} \min_{Y}\textrm{Tr}(Y^{T}YL) \label{LLECOST}\end{align}[/math]

where [math]L=(I-W)^{T}(I-W)[/math].

The solution for [math]Y[/math] can have an arbitrary origin and orientation. In order to make the problem well-posed,these two degrees of freedom must be removed. Requiring the coordinates to be centered on the origin ([math]\sum_i y_i =0 [/math]), and constraining the embedding vectors to have unit covariance ([math]Y^T Y=I[/math]), removes the first and second degrees of freedom respectively.

The cost function can be optimized initially by the second of these two constraints. Under this constraint, the cost is minimized when the columns of [math]Y^T[/math] (rows of [math]Y[/math]) are the eigenvectors associated with the lowest eigenvalues of [math]L[/math].

Discarding the eigenvector associated with eigenvalue 0 satisfies the first constraint.

## Laplacian Eigenmaps

[LEM] Given [math]t[/math] points in [math]n[/math]-dimensional space, Laplacian Eigenmaps Method (LEM) starts by constructing a weighted graph with [math]t[/math] nodes and a set of edges connecting neighboring points. Similar to LLE, the neighborhood graph can be constructed by finding the [math]k[/math] nearest neighbors, or by choosing all points within some fixed radius [math]\epsilon[/math]. For weighting the edges, there are two variations: either each edge is weighted by [math]W_{ij}=e ^{-\frac{||x_i-x_j||^2}{s}}[/math], where [math]s[/math] is a free parameter which should be chosen a priori, or simply all [math]W_{ij}[/math] is set to [math]1[/math] if vertices [math]i[/math] and [math]j[/math] are connected. The embedding map is then provided by the following objective

[math]\min_{Y}\sum_{i=1}^{t}\sum_{j=1}^{t}(\mathbf{y}_{i}-\mathbf{y}_{j})^{2}W_{ij}[/math] subject to appropriate constraints. This objective can be reformulated as

[math]\min_{Y} \textrm{Tr} (Y L Y^{T})[/math] where [math]L=R-W[/math], [math]R[/math] is diagonal, and [math]R_{ii}=\sum_{_{j=1}}^{t}W_{ij}[/math]. This [math]L[/math] is called the *Laplacian function*. Similar to [LLECOST], after adding orthogonality and centering constraint, a solution to this problem can be found by making [math]Y[/math] to be the eigenvectors of [math]L[/math](non-normalized solution). As an alternative, [LLECOST] can be constrained to [math] Y^T L Y=I[/math]. In this case , the solution is provided by the eigenvectors of the generalized eigenvalue problem [math]My =\lambda D y[/math] (normalized solution). Note that the final objectives for both LEM and LLE have the same form and differ only in how the matrix [math]L[/math] is constructed. Therefore, same closed form solution (taking [math]Y[/math] to be the eigenvectors of [math]L[/math]) works.

## Multidimensional scaling (MDS)

[MDS] Multidimensional scaling (MDS) is another classical approach that maps the original high dimensional space to a lower dimensional space that preserves pairwise distances. MDS addresses the problem of constructing a configuration of [math]t[/math] points in Euclidean space by using information about the distances between the [math]t[/math] patterns.

A [math]t \times t[/math] matrix [math]D[/math] is called a distance or affinity matrix if it is symmetric, [math]\verb"d"_{ii} = 0 [/math], and [math]~~\verb"d"_{ij} \gt 0, ~~ i \neq j[/math].

Given a distance matrix [math]D[/math], MDS attempts to find [math]t[/math] data points [math]y_1, ..., y_t[/math] in [math]d[/math] dimensions, such that if [math]\hat{d}_{ij}[/math] denotes the Euclidean distance between [math]y_i[/math] and [math]y_j[/math], then [math]\hat{D}[/math] is similar to [math]D[/math]. In particular, we consider metric MDS , which minimizes

[math]\begin{align} \min_Y \sum_{i=1}^t \sum_{i=1}^t (\verb"d"_{ij}^{(X)}-\verb"d"_{ij}^{(Y)})^2 \label{20}\end{align}[/math]

where [math]\verb"d"_{ij}^{(X)} = ||x_i-x_j||[/math] and [math]\verb"d"_{ij}^{(Y)} = ||y_i-y_j||[/math]. The distance matrix [math]D^{(X)}[/math] can be converted to inner products [math]X^TX[/math]. [math]X^T X = -\frac{1}{2} H D^{(X)H}[/math] where [math]H=I-\frac{1}{t}ee^T[/math] and [math]e[/math] is a column vector of all [math]1[/math]. Now ([20]) can be reduced to

[math]\begin{align} \min_Y \sum_{i=1}^t \sum_{i=1}^t (x_i^T x_j-y_i^Ty_i)^2 \label{(21)}\end{align}[/math]

It can be shown that the solution is [math]Y=\Lambda^{1/2} V^T[/math] where [math]V[/math] is the eigenvectors of [math]X^TX[/math] corresponding to the top [math]d[/math] eigenvalues, and [math]\Lambda[/math] is the top [math]d[/math] eigenvalues of [math]X^TX[/math]. Clearly the solution for MDS is identical to dual PCA (see Figure [alg 2]), and as far as Euclidean distance is concerned, MDS and PCA produce the same results. However, the distances need not be based on Euclidean distances and can represent many types of dissimilarities between objects.

## Isomap

Another recent approach to nonlinear dimensionality reduction is the Isomap algorithm. Isomap is a nonlinear generalization of classical MDS. The main contribution is to compute the MDS, not in the input space, but in the geodesic space of the manifold. The geodesic distances represent the shortest paths along the curved surface of the manifold measured as if the surface were flat. This can be approximated by a sequence of short steps between neighboring sample points. Isomap then applies MDS to the geodesic distances to find a low-dimensional mapping with similar pairwise distances.

Like LLE, the Isomap algorithm proceeds through three steps:

- Find the neighbors of each data point in high-dimensional data space.
- Compute the geodesic pairwise distances between all points.
- Embed the data via MDS so as to preserve these distances.

Again like LLE, the first step can be performed by identifying the [math]k[/math] nearest neighbors, or by choosing all points within some fixed radius, [math]\epsilon[/math]. These neighborhood relations are represented by a graph [math]G[/math] in which each data point is connected to its nearest neighbors, with edges of weight [math]d_X(i,j)[/math] between neighbors.

The geodesic distances [math]d_M(i,j)[/math] between all pairs of points on the manifold [math]M[/math] are then estimated in the second step. Isomap approximates [math]d_M(i,j)[/math] as the shortest path distance [math]d_G(i,j)[/math] in the graph [math]G[/math]. This can be done in different ways including Dijkstra’s algorithm and Floyd’s algorithm .

These algorithms finds matrix of graph distances [math]D^{(\mathcal{G})} [/math] contains the shortest path distance between all pairs of points in [math]G[/math].

In its final step, Isomap applies classical MDS to [math]D^{(\mathcal{G})}[/math] to generate an embedding of the data in a [math]d[/math]-dimensional Euclidean space [math]Y[/math].

The global minimum of the cost function is obtained by setting the coordinates of [math]y_i[/math] to the top [math]d[/math] eigenvectors of the inner-product matrix [math]B[/math] obtained from [math]D^{(\mathcal{G})}[/math]

## Semidefinite Embedding (SDE)

[SDE]

In 2004 Weinberger and Saul introduced Semidefinite Embedding (SDE) , which learns a kernel matrix instead of choosing a kernel function a priori. They formulated the problem of learning the kernel matrix as an instance of semidefinite programming. Since the kernel matrix [math]K[/math] represents inner products of vectors in a Hilbert space it must be positive semidefinite. Also the kernel should be centered, [math]\sum_{ij} K_{ij} = 0[/math]. Lastly, SDE imposes constraints on the kernel matrix to ensure that the distances and angles between points and their neighbors are preserved under the neighborhood graph [math]\eta[/math]. That is, if both [math]x_i[/math] and [math]x_j[/math] are neighbors ([math]\eta_{ij}=1[/math]) or are common neighbors of another input ([math][\eta^T \eta]_{ij} \gt 0[/math]), then: [math]||\Phi(x_i)-\Phi(x_j)||^2 = ||x_i - x_j||^2.[/math] In terms of the kernel matrix, this can be written as: [math]K_{ij}-2K_{ij}+K_{jj} = ||x_i - x_j||^2.[/math] By adding an objective function to maximize [math]\textrm{Tr}(K)[/math] which represents the variance of the data points in the learned feature space, SDE constructs a semidefinite program for learning the kernel matrix [math]K[/math]. The last detail of SDE is the construction of the neighborhood graph [math]\eta_{ij}[/math]. This graph is constructed by connecting the [math]k[/math] nearest neighbors using a similarity function over the data, [math]||x_i - x_j||[/math]. The algorithm is summarized in Table [tab:sde].

[h] [tab:sde]

<references />