stat946f10: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 309: Line 309:
Assume <math>A</math> is our data matrix with columns representing each data point. A matrix factorization of <math>A_{m\times n} \approx W_{m\times k}H_{k\times n}</math>, in general, can be thought as a way of expressing columns of A as a weighted sum of <math>k</math> bases. The <math>k</math> bases are, in fact, columns of <math>W</math> and the weights corresponding to the <math>i</math>th data point are located in the <math>i</math>th column of matrix <math>H</math>. <br>
Assume <math>A</math> is our data matrix with columns representing each data point. A matrix factorization of <math>A_{m\times n} \approx W_{m\times k}H_{k\times n}</math>, in general, can be thought as a way of expressing columns of A as a weighted sum of <math>k</math> bases. The <math>k</math> bases are, in fact, columns of <math>W</math> and the weights corresponding to the <math>i</math>th data point are located in the <math>i</math>th column of matrix <math>H</math>. <br>
As known, PCA or equivalently SVD gives matrices <math>W</math> and <math>H</math> satisfying the following optimiztion constraint:<br>
As known, PCA or equivalently SVD gives matrices <math>W</math> and <math>H</math> satisfying the following optimiztion constraint:<br>
<math>\min_{W,H} |A-WH|^2</math>
<math>\min_{W,H} |A-WH|^2</math> <br>
We want to factorize a (nonnegative) matrix A into two non-negative matrices, <math>W</math> and <math>H</math>, such that <math>A \approx WH</math>.  Nonnegative means all entries in the matrix are non-negative.


Similar to Singular Value Decomposition (SVD), but SVD does not guarantee non-negative entries.
In the above factorization, matrix <math>A</math> and the output matrices <math>W</math> and <math>H</math> in general may have negative or non-negative entries. <br>
However, there are many applications in which matrix <math>A</math> has only non-negative entries. Examples are images, word frequency vector of texts, DNA microarrays and music notes. In these applications, it would helpful to impose non-negative constrains on the <math>W</math> and <math>H</math> matrices above. Summarily speaking, we want to factorize nonnegative matrix A into two non-negative matrices, <math>W</math> and <math>H</math>, such that <math>A \approx WH</math>.  Nonnegative means all entries in the matrix are non-negative. So NMF, in a sense, is similar to Singular Value Decomposition (SVD), but SVD does not guarantee non-negative entries.


For some applications we may require/prefer non-negative entries.  e.g. in face/image data we may want our image intensities to be non-negative.  It makes sense to interpret an image reconstruction as adding a set of images with non-negative intensities.
As mentioned, for some applications we may require/prefer non-negative entries.  e.g. in face/image data we may want our image intensities to be non-negative.  It makes sense to interpret an image reconstruction as adding a set of images with non-negative intensities.


Nonnegative rank - Started from 1989 (Gregory and Pullman).
Nonnegative rank - Started from 1989 (Gregory and Pullman).

Revision as of 15:59, 16 June 2009

June 2nd Maximum Variance Unfolding (Semidefinite Embedding)

Maximum Variance Unfolding (MVU) is a variation of Kernel PCA in which the kernel matrix is also obtained from the data. The main proposal of this technique is not to choose a kernel function a priori like classical kernel PCA or construct a kernel matrix by algorithm like LLE and ISOMAP, but instead learn a kernel [math]\displaystyle{ K }[/math] optimizing an objective function with several constraints when the data set is given.

First, we give the constraints for the kernel.

Constraints

1. Semipositive definiteness
Kernel PCA is a kind of spectral decomposition in Hilbert space. The kernel matrix stores the inner products of vectors in a Hilbert space, hence it must be positive semidefinite. The semipositive definiteness means all eigenvalues are non-negative, i.e. [math]\displaystyle{ K\gt =0 }[/math].

2. Centering
Considering the centering process in Kernel PCA, it is also required here. The condition is given by
[math]\displaystyle{ \sum_i \Phi\left(x_i\right) =0 . }[/math]
Equivalently,
[math]\displaystyle{ 0 = \left|\sum_i \Phi(x_i)\right|^2 = \sum_{ij}\Phi(x_i)\Phi(x_j)=\sum_{ij}K_{ij}. }[/math]

3. Isometry
The local distance between a pairwise of data [math]\displaystyle{ x_i, x_j }[/math], under neighbourhood relation [math]\displaystyle{ \eta }[/math] (i.e. [math]\displaystyle{ \eta_{ij}=1 }[/math] indicates data [math]\displaystyle{ x_i, x_j }[/math] are neighbours), should be preserved in new space after mapping [math]\displaystyle{ \Phi(\cdot) }[/math]. In other words, for all [math]\displaystyle{ \eta_{ij}\gt 0 }[/math],
[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \left|x_i - x_j\right|^2. }[/math]
Additonally, for the consider of conformal map, the pairwise distance between two points having a common neighbour point should also be preserved. Two data points having a common neighbour can be identified as [math]\displaystyle{ [\eta^T\eta]_{ij}\gt 0. }[/math] This ensures that if two points have a common neighbour, we preserve their pairwise distances and angles.

[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \left(\Phi(x_i) - \Phi(x_j)\right)^{T}\left(\Phi(x_i) - \Phi(x_j)\right) }[/math]
[math]\displaystyle{ \left|\Phi(x_i) - \Phi(x_j)\right|^2 = \Phi(x_i)^{T}\Phi(x_i) - \Phi(x_j)^{T}\Phi(x_j) - 2 \Phi(x_i)^{T}\Phi(x_j) }[/math]

Thus, [math]\displaystyle{ K_{ii}+K_{jj}-2K_{ij}=\left|x_i - x_j\right|^2 }[/math] for all ij [math]\displaystyle{ \eta_{ij}\gt 0 }[/math] or [math]\displaystyle{ [\eta^T\eta]_{ij}\gt 0. }[/math]

Objective Functions

Given the conditions, the objective functions should be considered. The aim of dimensional reduction is to map high dimension data into a low dimension space with the minimum information losing cost. Recall the fact that the dimension of new space depends on the rank of the kernel. Hence, the best ideal kernel is the one which has minimum rank. So the ideal objective function should be
[math]\displaystyle{ \min\quad rank(K). }[/math]
However, minimizing the rank of a matrix is a hard problem. So we look at the question in another way. When doing dimensional reduction, we try to maximize the distance between non-neighbour data. In other words, we want to maximize the variance between non-neighbour datas, and it is the same as maximizing the sum of the eigenvalues. In such sense, we can change the objective function to
[math]\displaystyle{ \max \quad Trace(K) }[/math] .

Note that it is an interesting question that whether these two objective functions can be equivalent to each other. Although they are not totally equivallent, it can be shown that they usually converge to each other.

Algorithm for Optimization Problem

The objective function with linear constraints form a typical semidefinite programming problem. The optimization is convex and globally. We already have methods to slove such kind of optimization problem.

Colored Maximum Variance Unfolding .<ref>Song, L. and colleagues; Proceedings of the 2007 Conference, 1385-1392.</ref>

MVU is based on maximizing the overall variance while the local distances between neighbor points are preserved and it uses only one source of information. Colored MVU uses more than one source of information, i.e it reducing the dimension satisfying a combination of to goals
1- preserving the local distance (as first information)
2- optimum alignment with second information (side information)

Examples of how Colored MVU can leverage the side information
  • Given text data from a newsgroup as first information, a hierarchy of topics can be used as side information to guide the embedding.
  • Given term-frequency and inverse-document-frequency representation of academic papers as first information, co-author relationship can be used as side information to guide the embedding.
Rationale of separating the side information from the data
  • We cannot merge all kind of information in one distance metric because the data(first information) and the side information may be heterogeneous
  • The side information may be a feature of similarity(papers with the same co-authors tend to be more similar) rather than difference(papers with different co-authors are not necessarily far apart).
  • When inserting new information, usually only new data but not new side information is added.

Algorithmic Modification

In Colored MVU, [math]\displaystyle{ Trace(KL) }[/math] is maximized instead of [math]\displaystyle{ K }[/math], where [math]\displaystyle{ L }[/math] is the matrix of covariance of first and side information.

Application

One of the drawback of MVU is that its statistical interpretation is not always clear. However one of the application of Colored MVU, which has great statistical interpretation is to be used as a criterion to for measuring the Hilbert-Schmidt Independence.

Steps for SDE algorithm

  • Generate a K nearest neighbor graph. It should be a connected graph and so if K is too small it would be an unbounded problem, having no solution.
  • Semidefinite programming: Maximize [math]\displaystyle{ Tr(K) }[/math] subject to the above mentioned constraints.
  • Do kernel PCA with this learned kernel.

Advantages

  • The kernel that is learned from the data can actually reflect the intrinsic dimensionality of the data. More specifically, the eigen-spectrum of the kernel matrix K provides an estimation.
    The dimension needed to preserve local distance while maximizing variance is dependent on the number of dominant eigenvalues of K. That is, if top r eigenvalues of K account for 90% of the trace then an r dimensional representation can reveal about 90% of the unfolded data's variance.
  • MVU is a convex problem which guarantees a unique solution.
  • Distance-preserving constraints can be easily expressed and enforced in the semi-definite programming framework. This flexibility allows tailor-made constraints to be imposed on particular applications, for example analyzing robot motions(ARE).

Disadvantages

  • SDE can be solved efficiently in polynomial time but still has a high computational complexity. (O(matrix_size ^ 3 + number_of_constraints ^ 3))
  • SDE is limited to a isometric map

Application in SVM classification

The optimized kernel replaces the popular kernels using in SVM (i.e. linear kernel) for classification. It actually performs worse than other kernel functions chose in priori.

June 4th

Action Respecting Embedding (ARE)

It is a variation of Maximum Variance Unfolding.
The data here is temporal or ordered, i.e we move from one point to another by taking an action. In other words action [math]\displaystyle{ a_i }[/math] is taken between data points [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_{i+1} }[/math].
Action labels,even with no interpretation or implied meaning,provide more information about the underlying generation of the data.It is natural to expect that the actions correspond to some simple operator on the generator's own degrees of freedom.For example,a camera that is being panned left and then right,has actions that correspond to a simple translation in the camera's actuator space.We therefore want to constrain the learned representation so that the labeled actions correspond to simple transformations in that space.In particular,we can require all actions to be a simple rotation plus translation in the resulting low-dimensional representation.<ref> M.Bowling, A.Ghodsi, and D.Wilkinson. Action respecting embedding. In International Conferenceon Machine Learning,2005. </ref>
Consider action [math]\displaystyle{ a }[/math] taken between the points [math]\displaystyle{ x_i , x_{i+1} }[/math], and the points [math]\displaystyle{ x_j , x_{j+1} }[/math], in the original data space, it may not be a simple transformation (Rotation, Translation or combination of both).

A transformation [math]\displaystyle{ T }[/math] is called simple or distance preserving if and only if

[math]\displaystyle{ \forall x, x' }[/math] [math]\displaystyle{ \left \Vert T(x)-T(x') \right \|=\left \Vert x - x' \right \| }[/math]

Notice that [math]\displaystyle{ T_a(x_i)=x_{i+1} }[/math] and [math]\displaystyle{ T_a(x_j)=x_{j+1} }[/math]

In the low dimension space, as in the camera case where actions corresponds to a simple translation in the camera's actuator space, the action can become a simple transformation. Therefore constraining the action to be a simple transformation in dimension reduction would help us to find a low dimension representation close to the true one, if the action indeed corresponds to a simple transformation in the intrinsic dimension space.

The goal here is not only to reduce the dimensionality of the data but also reducing the complexity of actions in the sense that actions in this low dimension representation is a simple transformation. Therefore to obtain a low dimensional embedding of the high dimensional temporal data, the action in low dimension must be represented by a constraint that preserves the distance. This constraint is called action respecting constraint.

Constraint

For any two data points [math]\displaystyle{ x_i }[/math],[math]\displaystyle{ x_j }[/math] if the same action a [math]\displaystyle{ \left(a_{i}=a_{j}\right) }[/math] is carried out, transforming them into [math]\displaystyle{ x_{i+1} }[/math] and [math]\displaystyle{ x_{j+1} }[/math] respectively, then the distance between [math]\displaystyle{ y_i }[/math] and [math]\displaystyle{ y_j }[/math] must be equal to the distance between [math]\displaystyle{ y_{i+1} }[/math] and [math]\displaystyle{ y_{j+1} }[/math] where [math]\displaystyle{ y_i }[/math] , [math]\displaystyle{ y_j }[/math] , [math]\displaystyle{ y_{i+1} }[/math] , [math]\displaystyle{ y_{j+1} }[/math] are the corresponding points in the low dimension. This constraint is given as:
[math]\displaystyle{ \left|y_i - y_j\right|^2=\left|y_{i+1} - y_{j+1}\right|^2 \rightarrow \left|\Phi(x_i) - \Phi(x_j)\right|^2=\left|\Phi(x_{i+1}) - \Phi(x_{j+1})\right|^2 }[/math]
The kernel form of the above constarint is:
[math]\displaystyle{ \forall i, j a_{i}=a_{j} \Rightarrow K_{ii}+K_{jj}-2K_{ij}=K_{(i+1)(i+1)}+K_{(j+1)(j+1)}-2K_{(i+1)(j+1)} }[/math]


The above, action respecting constraint is added to the constraints of MVU and the algorithm of MVU is run to obtain a low dimension embedding for the temporal data.

Example

This example is extracted from the "Action Respecting Embedding" paper listed in the references.

Consider a virtual robot that observe a 100 by 100 patch of a 2048 by 1536 image. The actions of the robot consists of four translations(rightward/leftward/upward/downward). In this example, we consider two action sequences and compare their representations by SDE and ARE.

File:image2.jpg
This is the 2048 by 1536 image.The rectangular box corresponds to the area covered by the first sequence of actions. The square box corresponds to the area covered by the second sequence of actions.
File:image1.jpg
This is the first sequence of actions which consists of 40 rightward translations followed by 20 leftward translations.
File:image3.jpg
This is the second sequence of actions which consists of translations in all the four directions.

It is obvious that the first sequence of actions lie in a one-dimensional subspace and the second sequence lies in a two-dimensional subspace. Although both SDE and ARE succeed in capturing this low dimensionality, the embedding achieved by ARE is much smoother and corresponds much better(almost exactly) to the actual actions.

File:image4.jpg
Representations of the first sequence of actions.
File:image5.jpg
Representations of the second sequence of actions.



June 9th

Applications of ARE

  • Planning: To find a sequence of events to achieve a desired goal i.e. we want to find a path that leads us to the desired goal given the initial point and the set of all possible actions.


In ARE, the action is constrained to be a simple transformation in the low dimension space. After obtaining the low dimension representation through ARE, we have a set a points [math]\displaystyle{ \lbrace y_t \rbrace }[/math].

Consider a collection of data point pairs [math]\displaystyle{ \lbrace (y_t }[/math], [math]\displaystyle{ y_{t+1}) \rbrace }[/math] such that [math]\displaystyle{ y_t \xrightarrow{a} y_{t+1} }[/math], We can learn the action a as a simple transformation [math]\displaystyle{ f_a }[/math] such that

[math]\displaystyle{ f_a(y_t)=A_ay_t+b_a }[/math] subject to [math]\displaystyle{ A_a^TA_a=I }[/math]
we can do breath-first expansion for a tree starting from root (represents the intial point)by considering all possible actions(simple transformation) until the desired goal is reached.

  • Robot loaization: It is accomplished by using the motion and sensor probabilistic model. But using ARE, we can do robot localization in the low dimensional map rather than in the original space. This has the advantage that it becomes independent of the environmental constraints.

Metric Learning

Metric Learning is a supervised algorithm used for dimensionality reduction, in which class related side information are used. Two types of class-related information are brought in consideration, given a set of points [math]\displaystyle{ \{x_i, i=1, \cdots, m\} }[/math]. The first one is the similar set or class-equivalent set.

Similar Set
a set of pairs of similar points, denoted by [math]\displaystyle{ S }[/math]
[math]\displaystyle{ S : (x_i, x_j) \in S }[/math] if [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] are similar;

The second one is the dissimilar set or class-inequivalent set
Dissimilar Set
a set of pairs of dissimilar points, denoted by [math]\displaystyle{ D }[/math]
[math]\displaystyle{ D : (x_i, x_j) \in D }[/math] if [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] are dissimilar.

Note that pairs of points, which may not be known to be similar or dissimilar, will not be placed in either set. These two sets can come from knowing the class label or just the similarity or dissimilarity of some data. Some algorithms like Maximally Collapsing Metric Learning may require knowing the class label of data.

We want to learn a distance metric
[math]\displaystyle{ d_A(x_i, x_j) = \|x_i - x_j\|_A = \sqrt{(x_i-x_j)^T A(x_i-x_j)} }[/math], where [math]\displaystyle{ \|x_i - x_j\|_A }[/math] is not the euclidean distance but the mahalanobis distance determined by semi-definite matrix [math]\displaystyle{ A }[/math].
Equivalently, we want to learn semi-definite matrix [math]\displaystyle{ A }[/math] from the given data such that similiar points are close but dissimilar points are far apart.
[math]\displaystyle{ A= WW^T }[/math] where [math]\displaystyle{ W }[/math] is the transformation that maps data to the other space. The euclidean distance between the points in the transformed space is represented by mahalanobis distance in the oringinal space.

Such idea comes from firstly in 2004. After that, several different approaches are given to find the metric.

1. Original Optimization Problem

It is given by Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell in 2004 .<ref name="Xing">Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell, Distance metric learning, with application to clustering with side-information, </ref> . The authors give the optimization problem in following form:
[math]\displaystyle{ \min_A \sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A }[/math]
[math]\displaystyle{ s.t. \sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A \ge 1 ,(*) }[/math]
[math]\displaystyle{ A \ge 0 . }[/math]

The constraint is given to keep the distance between dissimilar points. If the constraint is ignored, [math]\displaystyle{ A = 0 }[/math] will be a trivial but not useful solution, which means all points collapse to a single point. The choice of constant [math]\displaystyle{ 1 }[/math] is not important, and can be changed to any other positive number. In the paper, it is also shown that it is a convex optimization problem. Hence, we can solve it by some efficient algoritms without getting stuck at local minimas. In this paper, the author also notes that some possible alternatives to (*) would not be a good choice. For example, if this constraint is changed to [math]\displaystyle{ s.t. \sum_{(x_i , x_j) \in D} \|x_i - x_j\|^2_A \ge 1 }[/math], though it maintains a linear constraint, it would result in A always being rank 1 (i.e., the data are always projected onto a line).

Algorithms to Find A

Since no analytical formula is known for solving [math]\displaystyle{ A }[/math] in the above formulation, iterative algorithms are developed to approximate [math]\displaystyle{ A }[/math]. During the iterative process, the algorithms have to ensure that [math]\displaystyle{ A \ge 0 }[/math].

Algorithm to find a full matrix diagonal A

In the general case where we seek a full matrix [math]\displaystyle{ A }[/math], the constraint [math]\displaystyle{ A \ge 0 }[/math] is tricky to enforce and brute-force Newton method's is prohibitively expensive. An algorithm which uses the ideas of gradient descent and iterative projections is given in the aforementioned paper as follows:

Iterate
Iterate
[math]\displaystyle{ A := \arg \min_{A'} \{ \| A' - A \|_F : A' \in C_1 \} }[/math](first projection)
[math]\displaystyle{ A := \arg \min_{A'} \{ \| A' - A \|_F : A' \in C_2 \} }[/math](second projection)
until [math]\displaystyle{ A }[/math] converges
[math]\displaystyle{ A := A + \alpha \nabla_A(g(A))_{\perp \nabla_A f} }[/math](gradient ascent)
until convergence

where [math]\displaystyle{ \| \cdot \|_F }[/math] is the Frobenium norm,
[math]\displaystyle{ C_1 = \{A:\sum_{(x_i, x_j) \in S}\| x_i - x_j \|^2_A \le 1\} }[/math] and [math]\displaystyle{ C_2 = \{A:A \ge 0 \} }[/math]

Efficient Algorithm to find a diagonal A

In the special case where we seek a diagonal matrix [math]\displaystyle{ A }[/math], the authors have derived a much more efficient algorithm, as explained below.

Obviously, letting A = I gives Euclidean distance. Now, let us suppose we want to learn a diagonal, that is [math]\displaystyle{ A=diag(A_{11},A_{22},...,A_{nn}) }[/math]

Define [math]\displaystyle{ g(A)=g(A_{11},A_{22},...,A_{nn})=\sum_{(x_i , x_j) \in S} \|x_i - x_j\|^2_A-log(\sum_{(x_i , x_j) \in D} \|x_i - x_j\|_A) }[/math]

We can use Newton-Raphson algorithm to optimize [math]\displaystyle{ g(A) }[/math].

2. Maximally Collapsing Metric Learning

Globerson & Roweis <ref> Globerson, A., and Roweis, S. 2006. Metric learning by collapsing classes. In Weiss, Y.; Sch¨olkopf, B.; and Platt, J., eds., Advances in Neural Information Processing Systems 18, 451–458. Cambridge, MA: MIT Press.</ref> proposed another metric learning method which considerably overperforms the original technique suggested by Xing, et. al. Based on a distance metric matrix [math]\displaystyle{ A }[/math], they define a conditional probability distribution function as
[math]\displaystyle{ P^A(j|i)=\frac{e^{-({d_{ij}^A})^2}}{\sum_{k\neq i} e^{-({d_{ik}^A})^2}} }[/math]
Ideally, we expect all the points in the same class collapse to one point and all the points in different classes get infinitely far apart from each other. In this ideal situation the conditional probabilty distributions would be
[math]\displaystyle{ p_0(j|i)\propto \left\{\begin{matrix} 1 & y_i=y_j \\ 0 & yi \neq y_j \end{matrix}\right. }[/math]

To learn a distance metric, Globerson & Roweis find the value of the matrix [math]\displaystyle{ A }[/math] such that the conditional probability distribution introduced above gets as close as possible to the ideal conditional distribution. To this end, they minimize the KL divergence between the two distributions (As known, KL divergence is a measure of difference between two probability distributions):
[math]\displaystyle{ \min_A \sum_{i} \textbf{KL}\left[ p_0(j|i)p^A(j|i)\right] }[/math]
subject to [math]\displaystyle{ A\succeq 0 }[/math]. This is a convex optimization problem and may be solved by a projected gradient approach similar to the one used in Xing, et. al. <ref name="Xing"/>.

3. PSD formulation

Another approach is given in <ref name="Ali Ghodsi"> Ali Ghodsi, Dana Wilkinson, Finnegan Southey, Improving Embeddings by Flexible Exploitation of Side Information</ref>, in which the loss function is given by

[math]\displaystyle{ L(A) = \sum_{(x_i, x_j) \in S } \|x_i - x_j\|^2_A - \sum_{(x_i, x_j)\in D} \|x_i - x_j\|_A^2. }[/math]

Motivation for using this loss function is that, it is minimized equivalently if its first component (sum of the differences between points in similarity class) is minimized while, its second component (sum of the differences between points in dissimilarity class) is maximized.
The optimization problem is
[math]\displaystyle{ \min_A L(A); s.t. A \ge 0, Tr(A) = 1 (1) }[/math].
The Positive semi-definiteness ([math]\displaystyle{ A \ge 0 }[/math]) constrain guarantees a valid Euclidean metric and the trace constraints is to prevent the solution [math]\displaystyle{ A =0 }[/math]. In order to be able to use standard semidefinite programing software [math]\displaystyle{ L(A) }[/math] must be linearized. To do so function [math]\displaystyle{ vec() }[/math] (which rearranges a matrix by concatenating its columns) which gives quite useful results like,

[math]\displaystyle{ vec(ABC)=(C^{T}*A)vec(B) }[/math].
in which [math]\displaystyle{ * }[/math] is the Kroneker product.

since [math]\displaystyle{ (x_{i}-x_{j})^{T}A(x_{i}-x_{j}) }[/math] is a scalar we can write

[math]\displaystyle{ (x_{i}-x_{j})^{T}A(x_{i}-x_{j})=vec((x_{i}-x_{j})^{T}A(x_{i}-x_{j})) }[/math]

also from Kroneker product for any two (same size) vectors [math]\displaystyle{ a , b }[/math] we have

[math]\displaystyle{ (a^{T}*b^{T})=vec(ba^{T})^{T} }[/math]

using the two results above it is easy to drive the following conclusion.

[math]\displaystyle{ L(A) = \sum_{(x_i, x_j) \in S } (x_i - x_j)^T A (x_i - x_j) - \sum_{(x_i, x_j)\in D} (x_i - x_j)^T A (x_i - x_j) }[/math]
[math]\displaystyle{ = \sum_{(x_i, x_j) \in S } vec(A)^T vec((x_i - x_j)(x_i - x_j)^T) - \sum_{(x_i, x_j)\in D} vec(A)^T vec((x_i - x_j)(x_i - x_j)^T) }[/math]
[math]\displaystyle{ = vec(A)^T \left[ \sum_{(x_i, x_j) \in S } vec((x_i - x_j)(x_i - x_j)^T) - \sum_{(x_i, x_j)\in D} vec((x_i - x_j)(x_i - x_j)^T) \right] }[/math]

This form along with the two linear constraints given in (1), makes a semidefinite positive problem that can be easily solved by a SDP solver, called SeDumi in Matlab. Therefore, it is a more convenient form than that used by Xing et all. Furthermore, in the original form, at least one dissimilar pair is required, while it is not necessary in the form given by Ali Ghodsi et al., because of the trace constraint. There can be only similar pairs, only dissimilar pairs, or any combination of the two, and the method will still avoid the trivial solution. Furthermore, in the absence of specific information regarding dissimilarities, Xing et al. assume that all points not explicitly identified as similar are dissimilar. This information may be misleading, forcing the algorithm to separate points that should be in fact be similar. The formulation presented by Ali Ghodsi et al. allows one to specify only the side information one actually has, partitioning the pairing into similar, dissimilar, and unknown.

June 11th

Closed form Metric learning (CFML)

As, [math]\displaystyle{ (x_i-x_j)^TA(x_i-x_j)=Tr((x_i-x_j)^T WW^T(x_i-x_j))=Tr(W^T(x_i-x_j)(x_i-x_j)^T W) }[/math]
The cost function to be minimized is:
[math]\displaystyle{ \min \frac 1S \operatorname{trace}(W^TM_SW)- \frac1D \operatorname{trace}(W^TM_DW) }[/math]
s.t. [math]\displaystyle{ \operatorname{trace}(A)=1 }[/math] or [math]\displaystyle{ \operatorname{trace}(WW^T)=1 }[/math]
Solving this as a lagrange multiplier problem we get,
[math]\displaystyle{ \mathbf {(M_S-M_D)} \mathbf W = \lambda \mathbf W }[/math]
This results in matrix[math]\displaystyle{ \mathbf W }[/math] being rank 1 i.e it consists of eigenvectors (as its columns) each having the same eigenvalue and therefore A is also rank 1. As a result all the data points are projected on a line.
Projection of the data points on a line is due to the constraint imposed on the cost function and therefore to avoid that we need to change our constarint. There are two alternative constraints that can be imposed on the cost function:

  • The constraint imposed is: [math]\displaystyle{ \mathbf W^T\mathbf W= \mathbf I_m }[/math].
    So, the objective function is:
    [math]\displaystyle{ \min_{\mathbf W}\operatorname {trace}(\mathbf W^T(\mathbf M_S-\mathbf M_D)\mathbf W) }[/math] s.t [math]\displaystyle{ \mathbf W^T\mathbf W=\mathbf I_m }[/math].

[math]\displaystyle{ \mathbf W }[/math] is the eigenvectors of [math]\displaystyle{ (\mathbf M_S-\mathbf M_D) }[/math].

  • The constraint is: [math]\displaystyle{ \mathbf W^T\mathbf M_S\mathbf W= \mathbf I_m }[/math].
    So, the objective function is:

[math]\displaystyle{ \min_{\mathbf W}\operatorname {trace}(\mathbf W^T(\mathbf M_S-\mathbf M_D)\mathbf W) }[/math]
s.t. [math]\displaystyle{ \mathbf W^T\mathbf M_S\mathbf W=\mathbf I_m }[/math]
this alternative algorithm is called CFML-II.
To solve this new form of optimization problem, let [math]\displaystyle{ \mathbf M_S=\mathbf {HH^T} }[/math]. Substituing this in our constraint and also considering[math]\displaystyle{ \mathbf {H^TW}=\mathbf Q }[/math], we get our cost function as:
[math]\displaystyle{ \min\operatorname {trace}(\mathbf Q^T(\mathbf I-\mathbf {H^{-1}M_DH^{-1^T}})\mathbf Q) }[/math] s.t [math]\displaystyle{ \mathbf Q^T\mathbf Q=\mathbf I }[/math]
[math]\displaystyle{ \mathbf Q }[/math] is the eigenvectors of [math]\displaystyle{ (\mathbf I-\mathbf {H^{-1}M_DH^{-1^T}}) }[/math].
That is, [math]\displaystyle{ (\mathbf I-\mathbf {H^{-1}M_DH^{-1^T}})\mathbf Q=\lambda\mathbf Q }[/math]

    ie, [math]\displaystyle{ \mathbf {H^{T^{-1}}H^{-1}}\mathbf M_D\mathbf W=\lambda\mathbf W }[/math]
ie, [math]\displaystyle{ \mathbf {M_S^{-1}}\mathbf M_D\mathbf W=\lambda\mathbf W }[/math]

This optimization problem is related to an old technique called Fischer discriminant analysis(FDA)

Comparison

So far , we have discussed a number of algorithms in metric learning. Xing et al., MCML, CFML, CFML-II, and FDA. Compared to the others, Xing doesn't give a good result, CFML and MCML compete with each other, CFML has a closed form and runs pretty fast, and FDA has a restriction on the rank such that given the number of classes as k, the rank is always equal to k-1.

Using partial distance side information <ref name="Ali Ghodsi"/>

In this case, only partial distances are known i.e. we are given exact distances between some pairs of points.
Suppose a set of similarities is given: [math]\displaystyle{ S : (x_i, x_j) \in S }[/math] if the target distance [math]\displaystyle{ d_{ij} }[/math] is known, then the cost function that preserves the local distnces is:
[math]\displaystyle{ \min_{\mathbf A} \sum_S\|\|x_i-x_j\|_A^2-d_{ij}\|^2 }[/math] s.t [math]\displaystyle{ \mathbf A \succeq 0 }[/math]
The above function can be written as:
[math]\displaystyle{ L(A)=\min_{\mathbf A} \sum_S\|vec(A)^T vec(B_{ij})-d_{ij}\|^2 }[/math]
[math]\displaystyle{ L(A)=\min_{\mathbf A} \sum_S(vec(A)^T vec(B_{ij})vec(B_{ij})vec(A)+d_{ij}^2)-2d_{ij}vec(A)^Tvec(B_{ij}) }[/math]
where, [math]\displaystyle{ B_{ij}=(x_i-x_j)(x_i-x_j)^T }[/math] and as [math]\displaystyle{ d_{ij}^2 }[/math] is independent of A, it can be dropped.
Therefore, the loss function is:
[math]\displaystyle{ L(A)=vec(A)^T[Qvec(A)-2R] }[/math] where, [math]\displaystyle{ Q=\sum_S vec(B_{ij})vec(B_{ij})^T }[/math] and [math]\displaystyle{ \sum_SR=2d_{ij}vec(B_{ij}) }[/math]

The above loss function being in the quadratic form, semi definite programming can not be applied. It can be converted to a linear function using 'Shur Complement.'

Shur Complement
[math]\displaystyle{ \begin{bmatrix} \mathbf X & \mathbf Y \\ \mathbf {Y^T} & \mathbf Z\end{bmatrix}\succeq 0 }[/math] if and only if [math]\displaystyle{ \mathbf {Z-Y^TX^{-1}Y}\succeq 0 }[/math]
By decomposing [math]\displaystyle{ \mathbf {Q = S^TS} }[/math], a matrix of the form
[math]\displaystyle{ J =\begin{bmatrix} I & Svec(A)\\(Svec(A))^T &2vec(A)^TR + t\end{bmatrix} }[/math] is constructed. By the Schur complement, if [math]\displaystyle{ J \succeq 0 }[/math], then the following relation holds
[math]\displaystyle{ \mathbf {2vec(A)^TR + t}- \mathbf {vec(A)^T S^T Svec(A)} \succeq 0 }[/math]
Scalar [math]\displaystyle{ \mathbf t }[/math] is an upper bound on the loss and therefore,
[math]\displaystyle{ \mathbf {vec(A)^T S^T Svec(A) }-\mathbf { 2vec(A)^TR} = \mathbf {vec(A)^TQvec(A)}-\mathbf{ 2vec(A)^TR} \lt = \mathbf t }[/math]
Therefore, minimizing t subject to [math]\displaystyle{ J \succeq 0 }[/math]also minimizes the objective. This optimization problem can be readily solved by standard semidefinite programming software
[math]\displaystyle{ \min_A \mathbf t }[/math] s.t. [math]\displaystyle{ \mathbf A \succeq 0 }[/math]and [math]\displaystyle{ \mathbf J \succeq 0 }[/math]

June 16th

Nonnegative Matrix Factorization (NMF)

Assume [math]\displaystyle{ A }[/math] is our data matrix with columns representing each data point. A matrix factorization of [math]\displaystyle{ A_{m\times n} \approx W_{m\times k}H_{k\times n} }[/math], in general, can be thought as a way of expressing columns of A as a weighted sum of [math]\displaystyle{ k }[/math] bases. The [math]\displaystyle{ k }[/math] bases are, in fact, columns of [math]\displaystyle{ W }[/math] and the weights corresponding to the [math]\displaystyle{ i }[/math]th data point are located in the [math]\displaystyle{ i }[/math]th column of matrix [math]\displaystyle{ H }[/math].
As known, PCA or equivalently SVD gives matrices [math]\displaystyle{ W }[/math] and [math]\displaystyle{ H }[/math] satisfying the following optimiztion constraint:
[math]\displaystyle{ \min_{W,H} |A-WH|^2 }[/math]

In the above factorization, matrix [math]\displaystyle{ A }[/math] and the output matrices [math]\displaystyle{ W }[/math] and [math]\displaystyle{ H }[/math] in general may have negative or non-negative entries.
However, there are many applications in which matrix [math]\displaystyle{ A }[/math] has only non-negative entries. Examples are images, word frequency vector of texts, DNA microarrays and music notes. In these applications, it would helpful to impose non-negative constrains on the [math]\displaystyle{ W }[/math] and [math]\displaystyle{ H }[/math] matrices above. Summarily speaking, we want to factorize nonnegative matrix A into two non-negative matrices, [math]\displaystyle{ W }[/math] and [math]\displaystyle{ H }[/math], such that [math]\displaystyle{ A \approx WH }[/math]. Nonnegative means all entries in the matrix are non-negative. So NMF, in a sense, is similar to Singular Value Decomposition (SVD), but SVD does not guarantee non-negative entries.

As mentioned, for some applications we may require/prefer non-negative entries. e.g. in face/image data we may want our image intensities to be non-negative. It makes sense to interpret an image reconstruction as adding a set of images with non-negative intensities.

Nonnegative rank - Started from 1989 (Gregory and Pullman).

History

NMF became popular with the publication of the work of Lee and Seung in 1999 <ref name="Lee1999"/>. They presented an algorithm for NMF capable of learning parts of faces and semantic features of text, which is in contrast to other methods, such as PCA and vector quantization, that learn holistic, not parts-based, representations.

Applications

DNA microarray experiments (Ilmels and Barkai, 2003) using gene expression data.

Retrieve notes from an audio recording of polyphonic music (Smaragdis and Brown, 2003).

NMF Algorithms

Alternate updates to [math]\displaystyle{ W }[/math] and [math]\displaystyle{ H^T }[/math] using an ascent direction (Lee and Seung).

Want to minimise [math]\displaystyle{ ||A - WH^T||_F }[/math] - linear least squares for either [math]\displaystyle{ W }[/math] or [math]\displaystyle{ H }[/math] if the other is fixed.

Alternatively use an [math]\displaystyle{ L^1 }[/math] penalty term to enhance sparsity (Kim and Park).

Observations

Leading singular vectors of a nonnegative matrix are nonnegative. Used as the basis for R1D (rank-1 downdate).

Simple rank-1 NMF using SVD. Gives rank-1 approximation of [math]\displaystyle{ A }[/math].

Rows and columns of [math]\displaystyle{ A }[/math] can be clustered using singular vectors. If there are similar entries in [math]\displaystyle{ U }[/math], there will be corresponding similar rows in [math]\displaystyle{ A }[/math]. Likewise, similar entries in [math]\displaystyle{ V }[/math] correspond to similar columns of [math]\displaystyle{ A }[/math]. i.e. [math]\displaystyle{ U }[/math] and [math]\displaystyle{ V }[/math] can be though of as lower dimensional representations of the rows and columns, respectively, of [math]\displaystyle{ A }[/math].

References

<references/>