# stat946f10

## Contents

- 1 June 2nd
**Maximum Variance Unfolding**(**Semidefinite Embedding**)- 1.1 Constraints
- 1.2 Objective Functions
- 1.3 Algorithm for Optimization Problem
- 1.4 Colored Maximum Variance Unfolding .<ref>Song, L. and colleagues; Proceedings of the 2007 Conference, 1385-1392.</ref>
- 1.5 Steps for SDE algorithm
- 1.6 Advantages
- 1.7 Disadvantages
- 1.8 Application in SVM classification

- 2 June 4th
- 3 June 9th
- 4 June 11th
- 5 References

## June 2nd ** Maximum Variance Unfolding** (**Semidefinite Embedding**)

Maximum Variance Unfolding (MVU) is a variation of Kernel PCA in which the kernel matrix is also obtained from the data. The main proposal of this technique is not to choose a kernel function a priori like classical kernel PCA or construct a kernel matrix by algorithm like LLE and ISOMAP, but instead learn a kernel [math]K[/math] optimizing an objective function with several constraints when the data set is given.

First, we give the constraints for the kernel.

### Constraints

**1. Semipositive definiteness**

Kernel PCA is a kind of spectral decomposition in Hilbert space. The kernel matrix stores the inner products of vectors in a Hilbert space, hence it must be positive semidefinite. The semipositive definiteness means all eigenvalues are non-negative, i.e. [math] K\gt =0[/math].

**2. Centering **

Considering the centering process in Kernel PCA, it is also required here. The condition is given by

[math]\sum_i \Phi\left(x_i\right) =0 .[/math]

Equivalently,

[math] 0 = \left|\sum_i \Phi(x_i)\right|^2 = \sum_{ij}\Phi(x_i)\Phi(x_j)=\sum_{ij}K_{ij}. [/math]

**3. Isometry**

The local distance between a pairwise of data [math]x_i, x_j[/math], under neighbourhood relation [math]\eta[/math] (i.e. [math]\eta_{ij}=1 [/math] indicates data [math]x_i, x_j[/math] are neighbours), should be preserved in new space after mapping [math]\Phi(\cdot)[/math]. In other words, for all [math]\eta_{ij}\gt 0 [/math],

[math]\left|\Phi(x_i) - \Phi(x_j)\right|^2 = \left|x_i - x_j\right|^2. [/math]

Additonally, for the consider of conformal map, the pairwise distance between two points having a common neighbour point should also be preserved. Two data points having a common neighbour can be identified as [math] [\eta^T\eta]_{ij}\gt 0. [/math] This ensures that if two points have a common neighbour, we preserve their pairwise distances and angles.

[math] \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \left(\Phi(x_i) - \Phi(x_j)\right)^{T}\left(\Phi(x_i) - \Phi(x_j)\right) [/math]

[math] \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \Phi(x_i)^{T}\Phi(x_i) - \Phi(x_j)^{T}\Phi(x_j) - 2 \Phi(x_i)^{T}\Phi(x_j)[/math]

Thus, [math] K_{ii}+K_{jj}-2K_{ij}=\left|x_i - x_j\right|^2[/math] for all ij [math] \eta_{ij}\gt 0 [/math] or [math][\eta^T\eta]_{ij}\gt 0.[/math]

### Objective Functions

Given the conditions, the objective functions should be considered. The aim of dimensional reduction is to map high dimension data into a low dimension space with the minimum information losing cost. Recall the fact that the dimension of new space depends on the rank of the kernel. Hence, the best ideal kernel is the one which has minimum rank. So the ideal objective function should be

[math] \min\quad rank(K). [/math]

However, minimizing the rank of a matrix is a hard problem. So we look at the question in another way. When doing dimensional reduction, we try to maximize the distance between non-neighbour data. In other words, we want to maximize the variance between non-neighbour datas, and it is the same as maximizing the sum of the eigenvalues. In such sense, we can change the objective function to

[math] \max \quad Trace(K) [/math] .

*Note that it is an interesting question that whether these two objective functions can be equivalent to each other.*
Although they are not totally equivallent, it can be shown that they usually converge to each other.

### Algorithm for Optimization Problem

The objective function with linear constraints form a typical semidefinite programming problem. The optimization is convex and globally. We already have methods to slove such kind of optimization problem.

### Colored Maximum Variance Unfolding .<ref>Song, L. and colleagues; Proceedings of the 2007 Conference, 1385-1392.</ref>

MVU is based on maximizing the overall variance while the local distances between neighbor points are preserved and it uses only one source of information. Colored MVU uses more than one source of information, i.e it reducing the dimension satisfying a combination of to goals

1- preserving the local distance (as first information)

2- optimum alignment with second information (side information)

##### Examples of how Colored MVU can leverage the side information

- Given text data from a newsgroup as first information, a hierarchy of topics can be used as side information to guide the embedding.
- Given term-frequency and inverse-document-frequency representation of academic papers as first information, co-author relationship can be used as side information to guide the embedding.

##### Rationale of separating the side information from the data

- We cannot merge all kind of information in one distance metric because the data(first information) and the side information may be heterogeneous
- The side information may be a feature of similarity(papers with the same co-authors tend to be more similar) rather than difference(papers with different co-authors are not necessarily far apart).
- When inserting new information, usually only new data but not new side information is added.

#### Algorithmic Modification

In Colored MVU, [math]Trace(KL)[/math] is maximized instead of [math]K[/math], where [math]L[/math] is the matrix of covariance of first and side information.

#### Application

One of the drawback of MVU is that its statistical interpretation is not always clear. However one of the application of Colored MVU, which has great statistical interpretation is to be used as a criterion to for measuring the Hilbert-Schmidt Independence.

### Steps for SDE algorithm

- Generate a K nearest neighbor graph. It should be a connected graph and so if K is too small it would be an unbounded problem, having no solution.
- Semidefinite programming: Maximize [math]Tr(K)[/math] subject to the above mentioned constraints.
- Do kernel PCA with this learned kernel.

### Advantages

- The kernel that is learned from the data can actually reflect the intrinsic dimensionality of the data. More specifically, the eigen-spectrum of the kernel matrix K provides an estimation.

The dimension needed to preserve local distance while maximizing variance is dependent on the number of dominant eigenvalues of K. That is, if top r eigenvalues of K account for 90% of the trace then an r dimensional representation can reveal about 90% of the unfolded data's variance. - MVU is a convex problem which guarantees a unique solution.
- Distance-preserving constraints can be easily expressed and enforced in the semi-definite programming framework. This flexibility allows tailor-made constraints to be imposed on particular applications, for example analyzing robot motions(ARE).

### Disadvantages

- SDE can be solved efficiently in polynomial time but still has a high computational complexity. (O(matrix_size ^ 3 + number_of_constraints ^ 3))
- SDE is limited to a isometric map

### Application in SVM classification

The optimized kernel replaces the popular kernels using in SVM (i.e. linear kernel) for classification. It actually performs worse than other kernel functions chose in priori.

## June 4th

### Action Respecting Embedding (ARE)

It is a variation of Maximum Variance Unfolding.

The data here is temporal or ordered, i.e we move from one point to another by taking an action. In other words action [math] a_i [/math] is taken between data points [math] x_i [/math] and [math] x_{i+1} [/math].

Action labels,even with no interpretation or implied meaning,provide more information about the underlying generation of the data.It is natural to expect that the actions correspond to some simple operator on the generator's own degrees of freedom.For example,a camera that is being panned left and then right,has actions that correspond to a simple translation in the camera's actuator space.We therefore want to constrain the learned representation so that the labeled actions correspond to simple transformations in that space.In particular,we can require all actions to be a simple rotation plus translation in the resulting low-dimensional representation.<ref>
M.Bowling, A.Ghodsi, and D.Wilkinson. Action respecting embedding. In International Conferenceon Machine Learning,2005.
</ref>

Consider action [math] a [/math] taken between the points [math] x_i , x_{i+1} [/math], and the points [math] x_j , x_{j+1} [/math], in the original data space, it may not be a simple transformation (Rotation, Translation or combination of both).

A transformation [math] T [/math] is called simple or distance preserving if and only if

[math] \forall x, x' [/math] [math]\left \Vert T(x)-T(x') \right \|=\left \Vert x - x' \right \|[/math]

Notice that [math] T_a(x_i)=x_{i+1}[/math] and [math] T_a(x_j)=x_{j+1}[/math]

In the low dimension space, as in the camera case where actions corresponds to a simple translation in the camera's actuator space, the action can become a simple transformation. Therefore constraining the action to be a simple transformation in dimension reduction would help us to find a low dimension representation close to the true one, if the action indeed corresponds to a simple transformation in the intrinsic dimension space.

The goal here is not only to reduce the dimensionality of the data but also reducing the complexity of actions in the sense that actions in this low dimension representation is a simple transformation. Therefore to obtain a low dimensional embedding of the high dimensional temporal data, the action in low dimension must be represented by a constraint that preserves the distance. This constraint is called action respecting constraint.

### Constraint

For any two data points [math] x_i[/math],[math] x_j [/math] if the same action a [math]\left(a_{i}=a_{j}\right)[/math] is carried out, transforming them into [math] x_{i+1} [/math] and [math] x_{j+1}[/math] respectively, then the distance between [math] y_i [/math] and [math] y_j [/math] must be equal to the distance between [math] y_{i+1}[/math] and [math]y_{j+1}[/math] where [math] y_i [/math] , [math] y_j [/math] , [math]y_{i+1} [/math] , [math] y_{j+1} [/math] are the corresponding points in the low dimension. This constraint is given as:

[math]\left|y_i - y_j\right|^2=\left|y_{i+1} - y_{j+1}\right|^2 \rightarrow \left|\Phi(x_i) - \Phi(x_j)\right|^2=\left|\Phi(x_{i+1}) - \Phi(x_{j+1})\right|^2[/math]

The kernel form of the above constarint is:

[math] \forall i, j a_{i}=a_{j} \Rightarrow K_{ii}+K_{jj}-2K_{ij}=K_{(i+1)(i+1)}+K_{(j+1)(j+1)}-2K_{(i+1)(j+1)} [/math]

The above, action respecting constraint is added to the constraints of MVU and the algorithm of MVU is run to obtain a low dimension embedding for the temporal data.

### Example

This example is extracted from the "Action Respecting Embedding" paper listed in the references.

Consider a virtual robot that observe a 100 by 100 patch of a 2048 by 1536 image. The actions of the robot consists of four translations(rightward/leftward/upward/downward). In this example, we consider two action sequences and compare their representations by SDE and ARE.

It is obvious that the first sequence of actions lie in a one-dimensional subspace and the second sequence lies in a two-dimensional subspace. Although both SDE and ARE succeed in capturing this low dimensionality, the embedding achieved by ARE is much smoother and corresponds much better(almost exactly) to the actual actions.

## June 9th

### Applications of ARE

- Planning: To find a sequence of events to achieve a desired goal i.e. we want to find a path that leads us to the desired goal given the initial point and the set of all possible actions.

Given a set of pts [math]y_t \rightarrow^a y_{t+1}[/math] we want to predict [math]y_t [/math] given [math]y_{t+1}[/math]. This can be formulated as a regression problem and therefore we find a functon such that,

[math] f_a(y_t)=A_ay_t+b_a [/math] subject to [math] A_a^TA_a=I [/math]

We build a tree starting from the initial point to the desired point by considering all possible actions and then find the shortest path to reach the goal.

- Robot loaization: It is accomplished by using the motion and sensor probabilistic model. But using ARE, we can do robot localization in the low dimensional map rather than in the original space. This has the advantage that it becomes independent of the environmental constraints.

### Metric Learning

Metric Learning is a supervised algorithm used for dimensionality reduction, in which some kind of extra sources of information (side information)are used besides the first source of variation. In more detail, two types of class-related information are brought in consideration.

Given a set of points [math]\{x_i, i=1, \cdots, m\}[/math], we define two different sets, similar and dissimilar.

**Similar Set**

a set of pairs of similar points, denoted by [math]S[/math]

[math]S : (x_i, x_j) \in S [/math] if [math]x_i[/math] and [math]x_j[/math] are similar;

**Dissimilar Set**

a set of pairs of dissimilar points, denoted by [math]D[/math]

[math]D : (x_i, x_j) \in D [/math] if [math]x_i[/math] and [math]x_j[/math] are dissimilar.

Note that a particular pair of points may not be known to be similar or dissimilar, in which case it will not be placed in either set.

We want to learn a distance metric

[math]d_A(x_i, x_j) = \|x_i - x_j\|_A = \sqrt{(x_i-x_j)^T A(x_i-x_j)}[/math], where [math]\|x_i - x_j\|_A[/math] is not the euclidean distance but the mahalanobis distance

which determined by semi-definite matrix [math]A[/math].
Equivalently, we want to know [math]A[/math] from the given data.

[math] A= WW^T[/math] where [math]W[/math] is the transformation that maps data from a high dimensional space to a low dimensional space. The euclidean distance between the points in the low dimensional space is represented by mahalanobis distance in the high dimensional space.

Such idea comes from firstly in 2004. After that, several different approaches are given to find the metric.

### 1. Original Optimization Problem

It is given by Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell in 2004 .<ref>Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell, Distance metric learning, with application to clustering with side-information, </ref> .
The authors give the optimization problem in following form:

[math]\min_A \sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A[/math]

[math]s.t. \sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A \ge 1 ,(*)[/math]

[math] A \ge 0 .[/math]

The constraint is given to keep the distance between dissimilar points. If the constraint is ignored, [math]A = 0[/math] will be an obvious solution, which means all points collapse to a single point. The choice of constant [math]1[/math] is not important, and can be changed to any other positive number. In the paper, it is also shown that it is a convex optimization problem. Hence, we can solve it use some efficient and direct algoritms without considering getting stuck at local minimas. In this paper, the author also notes that there are some possible alternatives to (*). [math]\sum_{(x_i, x_j) \in D }\|x_i - x_j\|_A \ge 1[/math] would not be a good choice despite it maintain a linear constraint. It would result in A always being rank 1 (i.e., the data are always projected onto a line).

#### Futher discussion about A

Obviously, letting A = I gives Euclidean distance. Now, let us suppose we want to learn a diagonal, that is [math]A=diag(A_{11},A_{22},...,A_{nn})[/math]

Define [math]g(A)=g(A_{11},A_{22},...,A_{nn})=\sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A-log(\sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A) [/math]

We can use Newton-Raphson algorithm to optimize [math]g(A)[/math].

### 2. PSD formulation

Another approach is given in <ref> Ali Ghodsi, Dana Wilkinson, Finnegan Southey, Improving Embeddings by Flexible Exploitation of Side Information</ref>, in which the loss function is given by

[math]L(A) = \sum_{(x_i, x_j) \in S } \|x_i - x_j\|^2_A - \sum_{(x_i, x_j)\in D} \|x_i - x_j\|_A^2.[/math]

Motivation for using this loss function is that, it is minimized equivalently if its first component (sum of the differences between points in similarity class) is minimized while, its second component (sum of the differences between points in dissimilarity class) is maximized.

The optimization problem is

[math] \min_A L(A); s.t. A \ge 0, Tr(A) = 1 (1)[/math].

The Positive semi-definiteness ([math] A \ge 0 [/math]) constrain guarantees a valid Euclidean metric and the trace constraints is to prevent the solution [math]A =0[/math].
In order to be able to use standard semidefinite programing software [math] L(A) [/math] must be linearized. To do so function [math] vec() [/math] (which rearranges a matrix by concatenating its columns) which gives quite useful results like,

[math] vec(ABC)=(C^{T}*A)vec(B) [/math].

in which [math]*[/math] is the Kroneker product.

since [math] (x_{i}-x_{j})^{T}A(x_{i}-x_{j}) [/math] is a scalar we can write

[math] (x_{i}-x_{j})^{T}A(x_{i}-x_{j})=vec((x_{i}-x_{j})^{T}A(x_{i}-x_{j})) [/math]

also from Kroneker product for any two (same size) vectors [math] a , b [/math] we have

[math] (a^{T}*b^{T})=vec(ba^{T})^{T} [/math]

using the two results above it is easy to drive the following conclusion.

[math] L(A) = \sum_{(x_i, x_j) \in S } (x_i - x_j)^T A (x_i - x_j) - \sum_{(x_i, x_j)\in D} (x_i - x_j)^T A (x_i - x_j)[/math]

[math] = \sum_{(x_i, x_j) \in S } vec(A)^T vec((x_i - x_j)(x_i - x_j)^T) - \sum_{(x_i, x_j)\in D} vec(A)^T vec((x_i - x_j)(x_i - x_j)^T) [/math]

[math] = vec(A)^T \left[ \sum_{(x_i, x_j) \in S } vec((x_i - x_j)(x_i - x_j)^T) - \sum_{(x_i, x_j)\in D} vec((x_i - x_j)(x_i - x_j)^T) \right] [/math]

This form along with the two linear constraints given in (1), makes a semidefinite positive problem that can be easily solved by a SDP solver, called SeDumi in Matlab. Therefore, it is a more convenient form than that used by Xing et all. Furthermore, in the original form, at least one dissimilar pair is required, while it is not necessary in the form given by Ali Ghodsi et al., because of the trace constraint. There can be only similar pairs, only dissimilar pairs, or any combination of the two, and the method will still avoid the trivial solution. Furthermore, in the absence of specific information regarding dissimilarities, Xing et al. assume that all points not explicitly identified as similar are dissimilar. This information may be misleading, forcing the algorithm to separate points that should be in fact be similar. The formulation presented by Ali Ghodsi et al. allows one to specify only the side information one actually has, partitioning the pairing into similar, dissimilar, and unknown.

## June 11th

### Closed form Metric learning (CFML)

As, [math] (x_i-x_j)^TA(x_i-x_j)=Tr((x_i-x_j)^T WW^T(x_i-x_j))=Tr(W^T(x_i-x_j)(x_i-x_j)^T W)[/math]

The cost function to be minimized is:

[math] \min \frac 1S \operatorname{trace}(W^TM_SW)- \frac1D \operatorname{trace}(W^TM_DW)[/math]

s.t. [math]\operatorname{trace}(A)=1[/math] or [math]\operatorname{trace}(WW^T)=1[/math]

Solving this as a lagrange multiplier problem we get,

[math]\mathbf {(M_S-M_D)} \mathbf W = \lambda \mathbf W[/math]

This results in matrix[math]\mathbf W[/math] being rank 1 i.e it consists of eigenvectors (as its columns) each having the same eigenvalue and therefore A is also rank 1. As a result all the data points are projected on a line.

Projection of the data points on a line is due to the constraint imposed on the cost function and therefore to avoid that we need to change our constarint. There are two alternative constraints that can be imposed on the cost function:

- The constraint imposed is: [math]\mathbf W^T\mathbf W= \mathbf I_m[/math].

So, the objective function is:

[math] \min_{\mathbf W}\operatorname {trace}(\mathbf W^T(\mathbf M_S-\mathbf M_D)\mathbf W)[/math] s.t [math] \mathbf W^T\mathbf W=\mathbf I_m[/math].

[math]\mathbf W[/math] is the eigenvectors of [math](\mathbf M_S-\mathbf M_D)[/math].

- The constraint is: [math]\mathbf W^T\mathbf M_S\mathbf W= \mathbf I_m[/math].

So, the objective function is:

[math] \min_{\mathbf W}\operatorname {trace}(\mathbf W^T(\mathbf M_S-\mathbf M_D)\mathbf W)[/math]

s.t. [math] \mathbf W^T\mathbf M_S\mathbf W=\mathbf I_m[/math]

this alternative algorithm is called CFML-II.

To solve this new form of optimization problem, let [math]\mathbf M_S=\mathbf {HH^T}[/math]. Substituing this in our constraint and also considering[math]\mathbf {H^TW}=\mathbf Q[/math], we get our cost function as:

[math] \min\operatorname {trace}(\mathbf Q^T(\mathbf I-\mathbf {H^{-1}M_DH^{-1^T}})\mathbf Q)[/math] s.t [math] \mathbf Q^T\mathbf Q=\mathbf I[/math]

[math]\mathbf Q[/math] is the eigenvectors of [math](\mathbf I-\mathbf {H^{-1}M_DH^{-1^T}})[/math].

That is, [math](\mathbf I-\mathbf {H^{-1}M_DH^{-1^T}})\mathbf Q=\lambda\mathbf Q[/math]

ie, [math]\mathbf {H^{T^{-1}}H^{-1}}\mathbf M_D\mathbf W=\lambda\mathbf W[/math]

ie, [math]\mathbf {M_S^{-1}}\mathbf M_D\mathbf W=\lambda\mathbf W[/math]

This optimization problem is related to an old technique called Fischer discriminant analysis(FDA)

### Comparison

So far , we have discussed a number of algorithms in metric learning. Xing et al., MCML, CFML, CFML-II, and FDA. Compared to the others, Xing doesn't give a good result, CFML and MCML compete with each other, CFML has a closed form and runs pretty fast, and FDA has a restriction on the rank such that given the number of classes as k, the rank is always equal to k-1.

### Using partial distance side information <ref> Ali Ghodsi, Dana Wilkinson, Finnegan Southey, Improving Embeddings by Flexible Exploitation of Side Information</ref>

In this case, only par>=tial distances are known i.e. we are given exact distances between some pairs of points.

Suppose a set of similarities is given: [math]S : (x_i, x_j) \in S [/math] if the target distance [math]d_{ij}[/math] is known, then the cost function that preserves the local distnces is:

[math]\min_{\mathbf A} \sum_S\|\|x_i-x_j\|_A^2-d_{ij}\|^2[/math] s.t [math] \mathbf A \succeq 0[/math]

The above function can be written as:

[math]L(A)=\min_{\mathbf A} \sum_S\|vec(A)^T vec(B_{ij})-d_{ij}\|^2[/math]

[math]L(A)=\min_{\mathbf A} \sum_S(vec(A)^T vec(B_{ij})vec(B_{ij})vec(A)+d_{ij}^2)-2d_{ij}vec(A)^Tvec(B_{ij})[/math]

where, [math]B_{ij}=(x_i-x_j)(x_i-x_j)^T[/math] and as [math]d_{ij}^2 [/math] is independent of A, it can be dropped.

Therefore, the loss function is:

[math]L(A)=vec(A)^T[Qvec(A)-2R][/math] where, [math]Q=\sum_S vec(B_{ij})vec(B_{ij})^T[/math] and [math]\sum_SR=2d_{ij}vec(B_{ij})[/math]

The above loss function being in the quadratic form, semi definite programming can not be applied. It can be converted to a linear function using 'Shur Complement.'

**Shur Complement**

[math]\begin{bmatrix} \mathbf X & \mathbf Y \\ \mathbf {Y^T} & \mathbf Z\end{bmatrix}\succeq 0 [/math] if and only if [math]\mathbf {Z-Y^TX^{-1}Y}\succeq 0[/math]

By decomposing [math]\mathbf {Q = S^TS}[/math], a matrix of the form

[math]J =\begin{bmatrix} I & Svec(A)\\(Svec(A))^T &2vec(A)^TR + t\end{bmatrix}[/math] is constructed. By the Schur complement, if [math]J \succeq 0[/math], then the following relation holds

[math]\mathbf {2vec(A)^TR + t}- \mathbf {vec(A)^T S^T Svec(A)} \succeq 0[/math]

Scalar [math]\mathbf t[/math] is an upper bound on the loss and therefore,

[math]\mathbf {vec(A)^T S^T Svec(A) }-\mathbf { 2vec(A)^TR} = \mathbf {vec(A)^TQvec(A)}-\mathbf{ 2vec(A)^TR} \lt = \mathbf t[/math]

Therefore, minimizing t subject to [math]J \succeq 0 [/math]also minimizes the objective. This optimization problem can be readily solved
by standard semidefinite programming software

[math]\min_A \mathbf t [/math] s.t. [math]\mathbf A \gt = 0 [/math]and [math]\mathbf J \gt = 0[/math]

## References

<references/>