stat946f10
June 2nd Maximum Variance Unfolding (Semidefinite Embedding)
Maximum Variance Unfolding (MFU) is a variation of Kernel PCA. The main proposal of this technique is to learn a suitable kernel [math]\displaystyle{ K }[/math] with several constraints when the data set is given.
First, we give the constraints for the kernel.
Constraints
1. Semipositive definiteness
Kernel PCA is a kind of spectral decomposition in Hilbert space. The semipositive definiteness interprets the kernel matrix as storing the inner products of vectors in a Hilbert space. Furthermore, the semipositive definiteness also means all eigenvalues are non-negative, i.e. [math]\displaystyle{ K\gt =0 }[/math].
2. Centering
Considering the centering process in Kernel PCA, it is also required here. The condition is given by
[math]\displaystyle{ \sum_i \Phi\left(x_i\right) =0 . }[/math]
Equivalently,
[math]\displaystyle{ 0 = \left|\sum_i \Phi(x_i)\right|^2 = \sum_{ij}\Phi(x_i)\Phi(x_j)=\sum_{ij}K_{ij}. }[/math]
3. Isometry
The local distance between a pairwise of data [math]\displaystyle{ x_i, x_j }[/math], under neighbourhood relation [math]\displaystyle{ \eta }[/math], should be preserved in new space after mapping [math]\displaystyle{ \Phi(\cdot) }[/math]. In other words, for all [math]\displaystyle{ \eta_{ij}\gt 0 }[/math],
[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \left|x_i - x_j\right|^2. }[/math]
Additonally, for the consider of conformal map, the pairwise distance between two points having a common neighbour point should also be preserved. Two data points having a common neighbour can be identified as [math]\displaystyle{ [\eta^T\eta]_{ij}\gt 0. }[/math] This ensures that if two points have a common neighbour, we preserve their pairwise distances and angles.
Thus, [math]\displaystyle{ K_{ii}+K_{jj}-2K_{ij}=\left|x_i - x_j\right|^2 }[/math] for all ij [math]\displaystyle{ \eta_{ij}\gt 0 }[/math] or [math]\displaystyle{ [\eta^T\eta]_{ij}\gt 0. }[/math]
Objective Functions
Given the conditions, the objective functions should be considered. The aim of dimensional reduction is to map high dimension data into a low dimension space with the minimum information losing cost. Recall the fact that the dimension of new space depends on the rank of the kernel. Hence, the best ideal kernel is the one which has minimum rank. So the ideal objective function should be
[math]\displaystyle{ \min\quad rank(K). }[/math]
However, minimizing the rank of a matrix is a hard problem. So we look at the question in another way. When doing dimensional reduction, we try to maximize the distance between non-neighbour data. In other words, we want to maximize the variance between non-neighbour datas. In such sense, we can change the objective function to
[math]\displaystyle{ \max \quad Trace(K) }[/math] .
Note that it is an interesting question that whether these two objective functions can be equivalent to each other. Although they are not totally equivallent, it can be shown that they usually converge to each other.
Algorithm for Optimization Problem
The objective function with linear constraints form a typical semidefinite programming problem. The optimization is convex and globally. We already have methods to slove such kind of optimization problem.
Colored Maximum Variance Unfolding
MVU is based on maximizing the overall variance while the local distances between neighbor points are preserved and it uses only one source of information. Colored MVU uses more than one source of information, i.e it reducing the dimension satisfying a combination of to goals
1- preserving the local distance (as first information)
2- optimum alignment with second information (side information)
Examples of how Colored MVU can leverage the side information
- Given text data from a newsgroup as first information, a hierarchy of topics can be used as side information to guide the embedding.
- Given term-frequency and inverse-document-frequency representation of academic papers as first information, co-author relationship can be used as side information to guide the embedding.
Rationale of separating the side information from the data
- We cannot merge all kind of information in one distance metric because the data(first information) and the side information may be heterogeneous
- The side information may be a feature of similarity(papers with the same co-authors tend to be more similar) rather than difference(papers with different co-authors are not necessarily far apart).
- When inserting new information, usually only new data but not new side information is added.
Algorithmic Modification
In Colored MVU, [math]\displaystyle{ Trace(KL) }[/math] is maximized instead of [math]\displaystyle{ K }[/math], where [math]\displaystyle{ L }[/math] is the matrix of covariance of first and side information.
Application
One of the drawback of MVU is that its statistical interpretation is not always clear. However one of the application of Colored MVU, which has great statistical interpretation is to be used as a criterion to for measuring the Hilbert-Schmidt Independence.
Steps for SDE algorithm
- Generate a K nearest neighbor graph. It should be a connected graph and so if K is too small it would be an unbounded problem, having no solution.
- Semidefinite programming: Maximize Tr(K) subject to [math]\displaystyle{ \K\gt =0, sum_{ij}K_{ij}=0 }[/math] subject to the above mentioned constraints.
- Do kernel PCA with this learned kernel.
Advantages
- The kernel that is learned from the data can actually reflect the intrinsic dimensionality of the data. More specifically, the eigen-spectrum of the kernel matrix K provides an estimation.
The dimension needed to preserve local distance while maximizing variance is dependent on the number of dominant eigenvalues of K. That is, if top r eigenvalues of K account for 90% of the trace then an r dimensional representation can reveal about 90% of the unfolded data's variance. - MVU is a convex problem, can be solved efficiently in polynomial time and guarantees a unique solution.
- Distance-preserving constraints can be easily expressed and enforced in the semi-definite programming framework. This flexibility allows tailor-made constraints to be imposed on particular applications, for example analyzing robot motions(ARE).
Disadvantages
- SDE has a high computational complexity. (O(matrix_size ^ 3 + number_of_constraints ^ 3))
June 4th
Action Respecting Embedding (ARE)
It is a variation of Maximum Variance Unfolding.
The data here is temporal or ordered, i.e we move from one point to another by taking an action. In other words action [math]\displaystyle{ a_i }[/math] is taken between data points [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_{i+1} }[/math] The goal here is not only to reduce the dimensionality of the data but also reducing the complexity of actions.
If two points undergo the same action, then the distance between those points must be preserved before the action and also after taking the action.Distance preserving transformations are rotation and translation or any combination of them and hence to obtain a low dimensional embedding of the high dimensional temporal data, the action in low dimension must be represented by a constraint that preserves the distance. This constraint is called action respecting constraint.
For any two data points [math]\displaystyle{ x_i }[/math],[math]\displaystyle{ x_j }[/math] if the same action a is carried out, transforming them into [math]\displaystyle{ x_{i+1} }[/math] and [math]\displaystyle{ x_{j+1} }[/math] respectively, then the distance between [math]\displaystyle{ y_i }[/math] and [math]\displaystyle{ y_j }[/math] must be equal to the distance between [math]\displaystyle{ y_{i+1} }[/math] and [math]\displaystyle{ y_{j+1} }[/math] where [math]\displaystyle{ y_i }[/math] , [math]\displaystyle{ y_j }[/math] , [math]\displaystyle{ y_{i+1} }[/math] , [math]\displaystyle{ y_{j+1} }[/math] are the corresponding points in the low dimension. This constraint is given as:
[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2=\left|\Phi(x_{i+1}) - \Phi(x_{j+1})\right|^2 }[/math]
The kernel form of the above constarint is:
[math]\displaystyle{ K_{ii}+K_{jj}-2K_{ij}=K_{(i+1)(i+1)}+K_{(j+1)(j+1)}-2K_{(i+1)(j+1)} }[/math]
The above, action respecting constraint is added to the constraints of MVU and the algorithm of MVU is run to obtain a low dimension embedding for the temporal data.