stat946f10: Difference between revisions
No edit summary |
|||
Line 175: | Line 175: | ||
Define <math>g(A)=g(A_{11},A_{22},...,A_{nn})=\sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A-log(\sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A) </math><br /> | Define <math>g(A)=g(A_{11},A_{22},...,A_{nn})=\sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A-log(\sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A) </math><br /> | ||
We can use Newton-Raphson algorithm to optimize | We can use Newton-Raphson algorithm to optimize <math>g(A)</math>. | ||
===2. PSD formulation === | ===2. PSD formulation === |
Revision as of 22:24, 10 June 2009
June 2nd Maximum Variance Unfolding (Semidefinite Embedding)
Maximum Variance Unfolding (MVU) is a variation of Kernel PCA in which the kernel matrix is also obtained from the data. The main proposal of this technique is not to choose a kernel function a priori like classical kernel PCA or construct a kernel matrix by algorithm like LLE and ISOMAP, but instead learn a kernel [math]\displaystyle{ K }[/math] optimizing an objective function with several constraints when the data set is given.
First, we give the constraints for the kernel.
Constraints
1. Semipositive definiteness
Kernel PCA is a kind of spectral decomposition in Hilbert space. The kernel matrix stores the inner products of vectors in a Hilbert space, hence it must be positive semidefinite. The semipositive definiteness means all eigenvalues are non-negative, i.e. [math]\displaystyle{ K\gt =0 }[/math].
2. Centering
Considering the centering process in Kernel PCA, it is also required here. The condition is given by
[math]\displaystyle{ \sum_i \Phi\left(x_i\right) =0 . }[/math]
Equivalently,
[math]\displaystyle{ 0 = \left|\sum_i \Phi(x_i)\right|^2 = \sum_{ij}\Phi(x_i)\Phi(x_j)=\sum_{ij}K_{ij}. }[/math]
3. Isometry
The local distance between a pairwise of data [math]\displaystyle{ x_i, x_j }[/math], under neighbourhood relation [math]\displaystyle{ \eta }[/math] (i.e. [math]\displaystyle{ \eta_{ij}=1 }[/math] indicates data [math]\displaystyle{ x_i, x_j }[/math] are neighbours), should be preserved in new space after mapping [math]\displaystyle{ \Phi(\cdot) }[/math]. In other words, for all [math]\displaystyle{ \eta_{ij}\gt 0 }[/math],
[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \left|x_i - x_j\right|^2. }[/math]
Additonally, for the consider of conformal map, the pairwise distance between two points having a common neighbour point should also be preserved. Two data points having a common neighbour can be identified as [math]\displaystyle{ [\eta^T\eta]_{ij}\gt 0. }[/math] This ensures that if two points have a common neighbour, we preserve their pairwise distances and angles.
[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \left(\Phi(x_i) - \Phi(x_j)\right)^{T}\left(\Phi(x_i) - \Phi(x_j)\right) }[/math]
[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \Phi(x_i)^{T}\Phi(x_i) - \Phi(x_j)^{T}\Phi(x_j) - 2 \Phi(x_i)^{T}\Phi(x_j) }[/math]
Thus, [math]\displaystyle{ K_{ii}+K_{jj}-2K_{ij}=\left|x_i - x_j\right|^2 }[/math] for all ij [math]\displaystyle{ \eta_{ij}\gt 0 }[/math] or [math]\displaystyle{ [\eta^T\eta]_{ij}\gt 0. }[/math]
Objective Functions
Given the conditions, the objective functions should be considered. The aim of dimensional reduction is to map high dimension data into a low dimension space with the minimum information losing cost. Recall the fact that the dimension of new space depends on the rank of the kernel. Hence, the best ideal kernel is the one which has minimum rank. So the ideal objective function should be
[math]\displaystyle{ \min\quad rank(K). }[/math]
However, minimizing the rank of a matrix is a hard problem. So we look at the question in another way. When doing dimensional reduction, we try to maximize the distance between non-neighbour data. In other words, we want to maximize the variance between non-neighbour datas, and it is the same as maximizing the sum of the eigenvalues. In such sense, we can change the objective function to
[math]\displaystyle{ \max \quad Trace(K) }[/math] .
Note that it is an interesting question that whether these two objective functions can be equivalent to each other. Although they are not totally equivallent, it can be shown that they usually converge to each other.
Algorithm for Optimization Problem
The objective function with linear constraints form a typical semidefinite programming problem. The optimization is convex and globally. We already have methods to slove such kind of optimization problem.
Colored Maximum Variance Unfolding .<ref>Song, L. and colleagues; Proceedings of the 2007 Conference, 1385-1392.</ref>
MVU is based on maximizing the overall variance while the local distances between neighbor points are preserved and it uses only one source of information. Colored MVU uses more than one source of information, i.e it reducing the dimension satisfying a combination of to goals
1- preserving the local distance (as first information)
2- optimum alignment with second information (side information)
Examples of how Colored MVU can leverage the side information
- Given text data from a newsgroup as first information, a hierarchy of topics can be used as side information to guide the embedding.
- Given term-frequency and inverse-document-frequency representation of academic papers as first information, co-author relationship can be used as side information to guide the embedding.
Rationale of separating the side information from the data
- We cannot merge all kind of information in one distance metric because the data(first information) and the side information may be heterogeneous
- The side information may be a feature of similarity(papers with the same co-authors tend to be more similar) rather than difference(papers with different co-authors are not necessarily far apart).
- When inserting new information, usually only new data but not new side information is added.
Algorithmic Modification
In Colored MVU, [math]\displaystyle{ Trace(KL) }[/math] is maximized instead of [math]\displaystyle{ K }[/math], where [math]\displaystyle{ L }[/math] is the matrix of covariance of first and side information.
Application
One of the drawback of MVU is that its statistical interpretation is not always clear. However one of the application of Colored MVU, which has great statistical interpretation is to be used as a criterion to for measuring the Hilbert-Schmidt Independence.
Steps for SDE algorithm
- Generate a K nearest neighbor graph. It should be a connected graph and so if K is too small it would be an unbounded problem, having no solution.
- Semidefinite programming: Maximize [math]\displaystyle{ Tr(K) }[/math] subject to the above mentioned constraints.
- Do kernel PCA with this learned kernel.
Advantages
- The kernel that is learned from the data can actually reflect the intrinsic dimensionality of the data. More specifically, the eigen-spectrum of the kernel matrix K provides an estimation.
The dimension needed to preserve local distance while maximizing variance is dependent on the number of dominant eigenvalues of K. That is, if top r eigenvalues of K account for 90% of the trace then an r dimensional representation can reveal about 90% of the unfolded data's variance. - MVU is a convex problem which guarantees a unique solution.
- Distance-preserving constraints can be easily expressed and enforced in the semi-definite programming framework. This flexibility allows tailor-made constraints to be imposed on particular applications, for example analyzing robot motions(ARE).
Disadvantages
- SDE can be solved efficiently in polynomial time but still has a high computational complexity. (O(matrix_size ^ 3 + number_of_constraints ^ 3))
- SDE is limited to a isometric map
Application in SVM classification
The optimized kernel replaces the popular kernels using in SVM (i.e. linear kernel) for classification. It actually performs worse than other kernel functions chose in priori.
June 4th
Action Respecting Embedding (ARE)
It is a variation of Maximum Variance Unfolding.
The data here is temporal or ordered, i.e we move from one point to another by taking an action. In other words action [math]\displaystyle{ a_i }[/math] is taken between data points [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_{i+1} }[/math].
Action labels,even with no interpretation or implied meaning,provide more information about the underlying generation of the data.It is natural to expect that the actions correspond to some simple operator on the generator's own degrees of freedom.For example,a camera that is being panned left and then right,has actions that correspond to a simple translation in the camera's actuator space.We therefore want to constrain the learned representation so that the labeled actions correspond to simple transformations in that space.In particular,we can require all actions to be a simple rotation plus translation in the resulting low-dimensional representation.<ref>
M.Bowling, A.Ghodsi, and D.Wilkinson. Action respecting embedding. In International Conferenceon Machine Learning,2005.
</ref>
Consider action [math]\displaystyle{ a }[/math] taken between the points [math]\displaystyle{ x_i , x_{i+1} }[/math], and the points [math]\displaystyle{ x_j , x_{j+1} }[/math], in the original data space, it may not be a simple transformation (Rotation, Translation or combination of both).
A transformation [math]\displaystyle{ T }[/math] is called simple or distance preserving if and only if
[math]\displaystyle{ \forall x, x' }[/math] [math]\displaystyle{ \left \Vert T(x)-T(x') \right \|=\left \Vert x - x' \right \| }[/math]
Notice that [math]\displaystyle{ T_a(x_i)=x_{i+1} }[/math] and [math]\displaystyle{ T_a(x_j)=x_{j+1} }[/math]
In the low dimension space, as in the camera case where actions corresponds to a simple translation in the camera's actuator space, the action can become a simple transformation. Therefore constraining the action to be a simple transformation in dimension reduction would help us to find a low dimension representation close to the true one, if the action indeed corresponds to a simple transformation in the intrinsic dimension space.
The goal here is not only to reduce the dimensionality of the data but also reducing the complexity of actions in the sense that actions in this low dimension representation is a simple transformation. Therefore to obtain a low dimensional embedding of the high dimensional temporal data, the action in low dimension must be represented by a constraint that preserves the distance. This constraint is called action respecting constraint.
Constraint
For any two data points [math]\displaystyle{ x_i }[/math],[math]\displaystyle{ x_j }[/math] if the same action a [math]\displaystyle{ \left(a_{i}=a_{j}\right) }[/math] is carried out, transforming them into [math]\displaystyle{ x_{i+1} }[/math] and [math]\displaystyle{ x_{j+1} }[/math] respectively, then the distance between [math]\displaystyle{ y_i }[/math] and [math]\displaystyle{ y_j }[/math] must be equal to the distance between [math]\displaystyle{ y_{i+1} }[/math] and [math]\displaystyle{ y_{j+1} }[/math] where [math]\displaystyle{ y_i }[/math] , [math]\displaystyle{ y_j }[/math] , [math]\displaystyle{ y_{i+1} }[/math] , [math]\displaystyle{ y_{j+1} }[/math] are the corresponding points in the low dimension. This constraint is given as:
[math]\displaystyle{ \left|y_i - y_j\right|^2=\left|y_{i+1} - y_{j+1}\right|^2 \rightarrow \left|\Phi(x_i) - \Phi(x_j)\right|^2=\left|\Phi(x_{i+1}) - \Phi(x_{j+1})\right|^2 }[/math]
The kernel form of the above constarint is:
[math]\displaystyle{ \forall i, j a_{i}=a_{j} \Rightarrow K_{ii}+K_{jj}-2K_{ij}=K_{(i+1)(i+1)}+K_{(j+1)(j+1)}-2K_{(i+1)(j+1)} }[/math]
The above, action respecting constraint is added to the constraints of MVU and the algorithm of MVU is run to obtain a low dimension embedding for the temporal data.
Example
This example is extracted from the "Action Respecting Embedding" paper listed in the references.
Consider a virtual robot that observe a 100 by 100 patch of a 2048 by 1536 image. The actions of the robot consists of four translations(rightward/leftward/upward/downward). In this example, we consider two action sequences and compare their representations by SDE and ARE.
It is obvious that the first sequence of actions lie in a one-dimensional subspace and the second sequence lies in a two-dimensional subspace. Although both SDE and ARE succeed in capturing this low dimensionality, the embedding achieved by ARE is much smoother and corresponds much better(almost exactly) to the actual actions.
June 9th Metric Learning
Metric Learning is a supervised algorithm used for dimensionality reduction, in which some kind of extra sources of information (side information)are used besides the first source of variation. In more detail, two types of class-related information are brought in consideration.
Given a set of points [math]\displaystyle{ \{x_i, i=1, \cdots, m\} }[/math], we define two different sets, similar and dissimilar.
Similar Set
a set of pairs of similar points, denoted by [math]\displaystyle{ S }[/math]
[math]\displaystyle{ S : (x_i, x_j) \in S }[/math] if [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] are similar;
Dissimilar Set
a set of pairs of dissimilar points, denoted by [math]\displaystyle{ D }[/math]
[math]\displaystyle{ D : (x_i, x_j) \in D }[/math] if [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] are dissimilar.
Note that a particular pair of points may not be known to be similar or dissimilar, in which case it will not be placed in either set.
We want to learn a distance metric
[math]\displaystyle{ d_A(x_i, x_j) = \|x_i - x_j\|_A = \sqrt{(x_i-x_j)^T A(x_i-x_j)} }[/math],
which determined by semi-definite matrix [math]\displaystyle{ A }[/math].
Equivalently, we want to know [math]\displaystyle{ A }[/math] from the given data.
Such idea comes from firstly in 2004. After that, several different approaches are given to find the metric.
1. Original Optimization Problem
It is given by Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell in 2004 .<ref>Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell, Distance metric learning, with application to clustering with side-information, </ref> .
The authors give the optimization problem in following form:
[math]\displaystyle{ \min_A \sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A }[/math]
[math]\displaystyle{ s.t. \sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A \ge 1 ,(*) }[/math]
[math]\displaystyle{ A \ge 0 . }[/math]
The constraint is given to keep the distance between dissimilar points. If the constraint is ignored, [math]\displaystyle{ A = 0 }[/math] will be an obvious solution, which means all points collapse to a single point. The choice of constant [math]\displaystyle{ 1 }[/math] is not important, and can be changed to any other positive number. In the paper, it is also shown that it is a convex optimization problem. Hence, we can solve it use some efficient and direct algoritms without considering getting stuck at local minimas. In this paper, the author also notes that there are some possible alternatives to (*). [math]\displaystyle{ \sum_{(x_i, x_j) \in D }\|x_i - x_j\|_A \ge 1 }[/math] would not be a good choice despite it maintain a linear constraint. It would result in A always being rank 1 (i.e., the data are always projected onto a line).
Futher discussion about A
Obviously, letting A = I gives Euclidean distance. Now, let us suppose we want to learn a diagonal, that is [math]\displaystyle{ A=diag(A_{11},A_{22},...,A_{nn}) }[/math]
Define [math]\displaystyle{ g(A)=g(A_{11},A_{22},...,A_{nn})=\sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A-log(\sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A) }[/math]
We can use Newton-Raphson algorithm to optimize [math]\displaystyle{ g(A) }[/math].
2. PSD formulation
Another approach is given in <ref> Ali Ghodsi, Dana Wilkinson, Finnegan Southey, Improving Embeddings by Flexible Exploitation of Side Information</ref>, in which the loss function is given by
[math]\displaystyle{ L(A) = \sum_{(x_i, x_j) \in S } \|x_i - x_j\|^2_A - \sum_{(x_i, x_j)\in D} \|x_i - x_j\|_A^2. }[/math]
Motivation for using this loss function is that, it is minimized equivalently if its first component (sum of the differences between points in similarity class) is minimized while, its second component (sum of the differences between points in dissimilarity class) is maximized.
The optimization problem is
[math]\displaystyle{ \min_A L(A); s.t. A \ge 0, Tr(A) = 1 (1) }[/math].
The Positive semi-definiteness ([math]\displaystyle{ A \ge 0 }[/math]) constrain guarantees a valid Euclidean metric and the trace constraints is to prevent the solution [math]\displaystyle{ A =0 }[/math].
In order to be able to use standard semidefinite programing software [math]\displaystyle{ L(A) }[/math] must be linearized. To do so function [math]\displaystyle{ vec() }[/math] (which rearranges a matrix by concatenating its columns) which gives quite useful results like,
[math]\displaystyle{ vec(ABC)=(C^{T}*A)vec(B) }[/math].
in which [math]\displaystyle{ * }[/math] is the Kroneker product.
since [math]\displaystyle{ (x_{i}-x_{j})^{T}A(x_{i}-x_{j}) }[/math] is a scalar we can write
[math]\displaystyle{ (x_{i}-x_{j})^{T}A(x_{i}-x_{j})=vec((x_{i}-x_{j})^{T}A(x_{i}-x_{j})) }[/math]
also from Kroneker product for any two (same size) vectors [math]\displaystyle{ a , b }[/math] we have
[math]\displaystyle{ (a^{T}*b^{T})=vec(ba^{T})^{T} }[/math]
using the two results above it is easy to drive the following conclusion.
[math]\displaystyle{ L(A) = \sum_{(x_i, x_j) \in S } (x_i - x_j)^T A (x_i - x_j) - \sum_{(x_i, x_j)\in D} (x_i - x_j)^T A (x_i - x_j) }[/math]
[math]\displaystyle{ = \sum_{(x_i, x_j) \in S } vec(A)^T vec((x_i - x_j)(x_i - x_j)^T) - \sum_{(x_i, x_j)\in D} vec(A)^T vec((x_i - x_j)(x_i - x_j)^T) }[/math]
[math]\displaystyle{ = vec(A)^T \left[ \sum_{(x_i, x_j) \in S } vec((x_i - x_j)(x_i - x_j)^T) - \sum_{(x_i, x_j)\in D} vec((x_i - x_j)(x_i - x_j)^T) \right] }[/math]
This form along with the two linear constraints given in (1), makes a semidefinite positive problem that can be easily solved by a SDP solver, called SeDumi in Matlab. Therefore, it is a more convenient form than that used by Xing et all. Furthermore, in the original form, at least one dissimilar pair is required, while it is not necessary in the form given by Ali Ghodsi et al., because of the trace constraint. There can be only similar pairs, only dissimilar pairs, or any combination of the two, and the method will still avoid the trivial solution. Furthermore, in the absence of specific information regarding dissimilarities, Xing et al. assume that all points not explicitly identified as similar are dissimilar. This information may be misleading, forcing the algorithm to separate points that should be in fact be similar. The formulation presented by Ali Ghodsi et al. allows one to specify only the side information one actually has, partitioning the pairing into similar, dissimilar, and unknown.
References
<references/>