http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Amir&feedformat=atomstatwiki - User contributions [US]2022-09-26T23:41:18ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3862sparse PCA2009-08-07T04:26:04Z<p>Amir: /* A Smoothing Technique */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program. For small problems, semidifinite programs can be solved via general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; <br> (i) </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, we should first obtain the solution <math>x</math> of the corresponding problem of form (1), if <math>X</math> is the optimal solution of the optimization problem. That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by the methods that are known in the literature; for example we can use the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain the next sparse component values.<br />
<br />
The question then is "when to stop?". Two approaches are proposed. First, at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure that each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). <br />
<br />
The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
=== A Smoothing Technique ===<br />
<br />
The numerical difficulties arising in large scale semidefinite programs stem from two distinct origins. <br />
<br />
I) Memory issue: beyond a certain problem size n, it becomes essentially impossible to form and store any second order information (Hessian) on the problem, which is the key to the numerical efficiency of interior-point SDP solvers. <br />
<br />
II) Smoothness issue: the constraint <math> X \geq 0 </math> is not smooth, hence the number of iterations required to solve problem (i) using first-order methods to an accuracy <math> \epsilon </math> is given by <math> 1/ \epsilon^2 </math><br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3861sparse PCA2009-08-07T04:21:21Z<p>Amir: /* Problem Formulation */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program. For small problems, semidifinite programs can be solved via general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; <br> (i) </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, we should first obtain the solution <math>x</math> of the corresponding problem of form (1), if <math>X</math> is the optimal solution of the optimization problem. That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by the methods that are known in the literature; for example we can use the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain the next sparse component values.<br />
<br />
The question then is "when to stop?". Two approaches are proposed. First, at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure that each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). <br />
<br />
The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
=== A Smoothing Technique ===<br />
<br />
The numerical difficulties arising in large scale semidefinite programs stem from two distinct origins. <br />
<br />
I) Memory issue: beyond a certain problem size n, it becomes essentially impossible to form and store any second order information (Hessian) on the problem, which is the key to the numerical efficiency of interior-point SDP solvers. <br />
<br />
II) Smoothness issue: the constraint <math> X \geq 0 </math> is not smooth, hence the number of iterations required to solve problem<br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3860sparse PCA2009-08-07T04:15:54Z<p>Amir: /* A Smoothing Technique */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program. For small problems, semidifinite programs can be solved via general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, we should first obtain the solution <math>x</math> of the corresponding problem of form (1), if <math>X</math> is the optimal solution of the optimization problem. That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by the methods that are known in the literature; for example we can use the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain the next sparse component values.<br />
<br />
The question then is "when to stop?". Two approaches are proposed. First, at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure that each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). <br />
<br />
The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
=== A Smoothing Technique ===<br />
<br />
The numerical difficulties arising in large scale semidefinite programs stem from two distinct origins. <br />
<br />
I) Memory issue: beyond a certain problem size n, it becomes essentially impossible to form and store any second order information (Hessian) on the problem, which is the key to the numerical efficiency of interior-point SDP solvers. <br />
<br />
II) Smoothness issue: the constraint <math> X \geq 0 </math> is not smooth, hence the number of iterations required to solve problem<br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3859sparse PCA2009-08-07T04:12:02Z<p>Amir: /* A Smoothing Technique */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program. For small problems, semidifinite programs can be solved via general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, we should first obtain the solution <math>x</math> of the corresponding problem of form (1), if <math>X</math> is the optimal solution of the optimization problem. That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by the methods that are known in the literature; for example we can use the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain the next sparse component values.<br />
<br />
The question then is "when to stop?". Two approaches are proposed. First, at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure that each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). <br />
<br />
The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
=== A Smoothing Technique ===<br />
<br />
The numerical difficulties arising in large scale semidefinite programs stem from two distinct origins. <br />
<br />
I) Memory issue: beyond a certain problem size n, it becomes essentially impossible to form and store any second order information (Hessian) on the problem, which is the key to the numerical efficiency of interior-point SDP solvers. <br />
<br />
II) Smoothness issue: the constraint <math> X \geq 0 </math><br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3858sparse PCA2009-08-07T04:07:40Z<p>Amir: /* The Main Loop */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program. For small problems, semidifinite programs can be solved via general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, we should first obtain the solution <math>x</math> of the corresponding problem of form (1), if <math>X</math> is the optimal solution of the optimization problem. That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by the methods that are known in the literature; for example we can use the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain the next sparse component values.<br />
<br />
The question then is "when to stop?". Two approaches are proposed. First, at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure that each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). <br />
<br />
The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
=== A Smoothing Technique ===<br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3857sparse PCA2009-08-07T03:58:50Z<p>Amir: /* The Main Loop */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program. For small problems, semidifinite programs can be solved via general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, we should first obtain the solution <math>x</math> of the corresponding problem of form (1), if <math>X</math> is the optimal solution of the optimization problem. That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by the methods that are known in the literature; for example we can use the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain the next sparse component values.<br />
<br />
The question then is "when to stop?". Two approaches are proposed. First, at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure that each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). <br />
<br />
The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3856sparse PCA2009-08-07T03:32:08Z<p>Amir: /* Contribution */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program. For small problems, semidifinite programs can be solved via general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, if <math>X</math> is the optimal solution of the optimization problem, we first need to obtain the solution <math>x</math> of the corresponding problem of form (1). That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by methods that are known in the literature; for example using the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br><br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain next sparse component values.<br />
<br />
The question then is when to stop. Two approaches are proposed. First is that at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3855sparse PCA2009-08-07T03:05:07Z<p>Amir: /* Contribution */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach (called DSPCA) that improves the sparsity of the principle components is presented. This is done in 2 stages. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program.<br /><br />
for small problems, semidifinite programs can be solved vi general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, if <math>X</math> is the optimal solution of the optimization problem, we first need to obtain the solution <math>x</math> of the corresponding problem of form (1). That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by methods that are known in the literature; for example using the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br><br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain next sparse component values.<br />
<br />
The question then is when to stop. Two approaches are proposed. First is that at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=sparse_PCA&diff=3854sparse PCA2009-08-07T03:03:45Z<p>Amir: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
In PCA, Given <math>n</math> observations on <math>d</math> variables (or in other words <math>n</math> <math>d</math>-dimensional data points), our goal is to find directions in the space of the data set that correspond to the directions with biggest variance in the input data. In practice each of the <math>d</math> variables has its own special meaning and it may be desirable to come up with some directions, as principal components, each of which is a combination of just a few of these variables. This makes the directions more interpretable and meaningful. But this is not something that usually happens as the original result of PCA method. Each of resulting directions from PCA in most cases is a linear combination of all variable with no zero coefficients. <br />
<br />
To address the above concerns we add a sparsity constraint to the PCA problem, which makes the PCA problem much harder to solve. That's because we have just added a combinatorial constraint to optimization problem. This paper is showing us how to find directions in the data space with maximum variance that have a limited number of non-zero elements. In other words, this helps us to perform feature selection, by selecting a subset of features in each direction.<br />
<br />
==Contribution==<br />
In this paper, a direct approach(called DSPCA) that improves the sparsity of the principle components is given. This will be done in 2 states. First, incorporating a sparsity criterion in the PCA formulation. Second, forming a convex relation of the problem that is a semidefinite program.<br /><br />
for small problems, semidifinite programs can be solved vi general purpose interior-point methods. However, these methods can not be used for high dimensional problems. In this case, a saddle point problem can express our particular problem. For this kind of problems, smoothing argument algorithms combined with an optimal first-order smooth minimization algorithm offer a significant reduction in computational time and therefore can be used instead of generic interior point SDP solvers.<br />
<br />
== Notation ==<br />
The following notations are used in this note.<br />
<br />
<math>S^n \,</math> is the set of symmetric matrices of size <math>n \,</math>.<br /><br />
<math> \textbf{1} \,</math> is a column vector of ones.<br /><br />
<math> \textbf{Card}(x) \, </math> denotes the cardinality (number of non-zero elements) of a vector <math>x \, </math><br /><br />
<math> \textbf{Card}(X) \, </math> denotes the cardinality (number of non-zero elements) of a matrix <math>X \, </math><br /><br />
For <math> X \in S^n \, </math>, <math> X \succeq 0 \, </math> means <math>X \,</math> is positive semi-definite.<br /><br />
<math>|X| \,</math> is the matrix whose elements are the absolute values of the elements of <math> X \, </math><br />
<br />
==Problem Formulation==<br />
<br />
Given the covariance matrix <math>A</math>, the problem can be written as:<br><br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& x^TAx\\<br />
\textrm{subject\ to}& ||x||_2=1\\<br />
&\textbf{Card}(x)\leq k<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (1) </td></tr></table><br />
<br />
The cardinality constraint makes this problem hard (NP-hard) and we are looking for a convex and efficient relaxation.<br /><br />
<br />
Defining <math>X=x^Tx</math>, the above formula can be rewritten as<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{Card}(X)\leq k^2\\<br />
&X\succeq 0, \textbf{Rank}(X)=1\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (2) </td></tr></table><br />
<br />
The conditions <math>X\succeq 0</math> and <math>\textbf{Rank}(X)=1</math> in formula 2 guarantees that <math>X</math> can be written as <math>x^Tx</math>, for some <math>x</math>. But this formulation should be relaxed before it can be solved efficiently, because the constraintS <math>\textbf{Card}(X)\leq k^2</math> and <math>\textbf{Rank}(X)=1</math> are not convex. So we replace the cardinality constraint with a weaker one: <math>\textbf{1}^T|X|\textbf 1\leq k</math>. We also drop the rank constraint. So we get:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&\textbf{1}^T|X|\textbf 1\leq k\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
The above semidefinite relaxation can even be generalised to a non square matrix <math> A \in R^{mxn}</math> as follows:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX^{12})\\<br />
\textrm{subject\ to}&\textbf{Tr}(X^{ii})=1\\<br />
&\textbf{1}^T|X^{ii}|\textbf 1\leq k_i, i=1,2\\<br />
&\textbf{1}^T|X^{12}|\textbf 1\leq \sqrt{k_1k_2}\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> &nbsp; </td></tr></table><br />
<br />
in the variable <math> X \in S^{m+n} </math> with blocks <math> X^{ij} </math> for i,j=1,2.<br />
<br />
We then change the modified cardinality constraint to a penalty term in the goal function with some positive factor <math>\rho</math>. So we get a semidefinite form of the problem:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (3) </td></tr></table><br />
<br />
where, <math> \rho </math> controls the penalty magnitude.<br />
<br />
The goal function can be rewritten as <math>\textbf{Tr}(AX)-\rho\textbf{1}^T|X|\textbf 1=\min_{|U_{ij}|\leq\rho}\textbf{Tr}((A+U)X)</math>. So the problem (3) is equivalent to:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{ll}<br />
\textrm{maximize}& \min_{|U_{ij}|\leq\rho}\textbf{Tr}(X(A+U))\\<br />
\textrm{subject\ to}&\textbf{Tr}(X)=1\\<br />
&X\succeq 0\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (4) </td></tr></table><br />
<br />
or equivalently, due to convexity:<br />
<br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math> <br />
\begin{array}{lll}<br />
\textrm{minimize}& \lambda^{\max}(A+U)\\<br />
\textrm{subject\ to}&|U_{ij}|\leq\rho,&i,j=1\cdots,n\\<br />
\end{array}<br />
</math><br />
</td><td valign=top> <br> (5) </td></tr></table><br />
<br />
where <math>\lambda^{\max}(M)</math> is the largest eigenvalue of the matrix <math>M</math>.<br />
<br />
The problem as described in formulation (5) can be seen as computing a robust version of maximum eigenvalue: it is the least possible value of maximum eigenvalue, given that each element can be changed by at most noise value <math>\rho</math>. Also, it corresponds to a worst-case maximum eigenvalue computation with a bounded noise of intensity <math>\rho</math> in each component on the matrix coefficients.<br><br />
<br />
The KKT conditions for optimization problems (3) and (5) are given by:<br><br />
<table width=95%><tr><td width=45%>&nbsp;</td><td width=45%><br />
<math><br />
\left\{<br />
\begin{array}{rl}<br />
&(A+U)X=\lambda^{\max}(A+U)X\\<br />
&U\circ X=\rho |X| \\<br />
&\text{Tr}(X)=1,\,\,\,X\succeq 0 \\<br />
&|U_{i,j}|\leq \rho,\,\,\, i,j=1,\cdots ,n<br />
\end{array} \right.<br />
</math><br />
</td><td valign=top> <br> </td></tr></table><br><br />
If the <math>\lambda^{\max}</math> in the first equation is simple (meaning it is of multiplicity 1) and <math>\rho</math> is sufficiently small, from the first equation it follows that <math>\textbf{Rank}(X)=1</math>. In fact, the form of this equation implies that all columns of <math>\,X</math> are eigenvectors of matrix <math>A+U</math> corresponding to its maximum eigenvalue. So the rank one constraint is automatically satisfied in this special case.<br />
<br />
==The Algorithm==<br />
===The Main Loop===<br />
The algorithm should iteratively create the semidefinite program (4) and solve it to obtain the next most important sparse principle component. At each iteration, if <math>X</math> is the optimal solution of the optimization problem, we first need to obtain the solution <math>x</math> of the corresponding problem of form (1). That will be straightforward if <math>X</math> is of rank 1, but since we have dropped the rank constraint, this may not be true and in those cases we need to obtain the ''dominant'' eigenvalue of <math>X</math> by methods that are known in the literature; for example using the power method which efficiently provides us with the largest eigenvectors of a matrix. Note, however, that in this case the resulting vectors are not guaranteed to be as sparse as the matrix itself.<br><br />
After obtaining a (hopefully) sparse vector <math>x</math> we replace the matrix <math>A</math> with <math>A-(x_1^TAx_1)x_1x_1^T</math> and repeat the above steps to obtain next sparse component values.<br />
<br />
The question then is when to stop. Two approaches are proposed. First is that at each iteration <math>i</math>, for all <math>i<j</math>, we include the constraint <math>x_i^TXx_i=0</math> to make sure each principal component we compute is orthogonal to the previous ones. Then the procedure stops after <math>n</math> steps automatically (there will be no solution to the <math>n+1</math>'th problem). The other way is stop as soon as all members of <math>A</math> get less than <math>\rho</math>, because at that point elements of <math>A</math> will be less than the noise value <math>\rho</math>.<br />
<br />
===Solving the Semidefinite Problem===<br />
The cardinality constraint in the formulation (or its corresponding term in the penalized form) introduces a quadratic number of terms in the problem. This makes it practically impossible to use interior-point method to solve the problem for large values of the input dimension. So we need to use other existing methods for solving the problem, but then, there will be a matter of speed. Denoting the required accuracy by <math>\epsilon</math>, we can expect an interior-point-based program to converge after <math>\textstyle O(F(n)\log\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size <math>n</math>. Here we will manage to solve the problem using <math>\textstyle O(\frac{F(n)}{\epsilon})</math> iterations using a first-order scheme, for some function <math>F</math> of input size.<br />
<br />
First-order method can solve a problem after <math>\textstyle O(\frac1{\epsilon})</math> iterations, for some function <math>F</math> of input size, if the problem satisfy a ''smoothness'' constraint. But in our case, <math>X\succeq 0</math> is not ''smooth'' and as a result, an application of first-order method to the our problem will result in an algorithm stopping after <math>\textstyle O(\frac1{\epsilon^2})</math> iterations, for some function <math>F</math> of input size, which is too slow. To address this problem, we consider the formulation (5) and then we define a smooth approximation of the function <math>\lambda^{\max}</math>.<br />
<br />
To come up with a smooth approximation of our goal function, we define <math>f_{\mu}(X)=\mu\log\textbf{Tr}(e^{\frac X\mu})</math>. Then, one can verify that <math>\lambda^{\max}(X)\leq f_\mu(X)\leq\lambda^{\max}(X)+\mu\log n</math> and so, for <math>\textstyle\mu=\frac{\epsilon}{\log n}</math>, <math>f_{\mu}</math> is a smooth approximation of <math>\lambda^{\max}</math> with an additive error of <math>\epsilon</math>.<br />
This way we obtain a scheme for solving the program in <math>\textstyle\frac d{\epsilon}\sqrt{\log d}</math> iterations, each taking <math>O(d^3)</math> time.<br />
<br />
==Experimental Results==<br />
Each point in figures below corresponds to an experiment on 500 genes. The points are pre-clustered to 4 clusters based some prior knowledge. The top three principal components are computed using each of PCA and Sparse PCA methods, and the points are plotted in the bases defined by these <br />
three components. For the PCA, each principal component is a combination of all 500 variables (corresponding to 500 genes) while in sparse PCA each involves variables corresponding to at most 6 genes.<br />
[[Image:Spca-g.jpg|thumb|900px|center|Figure 1. Distribution of gene expression data in the PCA vs. Sparse PCA. The point colors are based on an pre-computed independent clustering.]]<br />
<br />
In the next figure, the left diagram compares the cumulative number of non-zero elements in principal components in three methods: SPCA, the method we explaind with <math>k=5</math>, and the method we explaind with <math>k=6</math>. In the right diagram, the cumulative percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines). <br />
[[Image:Spca-a.jpg|thumb|900px|center|Figure 2. Cumulative cardinality and total percentage of total variance explained by the first principle components resulted from PCA, SPCA (dashed) and the method we explained with <math>k=5</math> and <math>k=6</math> (solid lines).]]</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_a_Nonlinear_Embedding_by_Preserving_Class_Neighborhood_Structure&diff=3848learning a Nonlinear Embedding by Preserving Class Neighborhood Structure2009-08-05T18:25:17Z<p>Amir: /* Experiments */</p>
<hr />
<div>=Introduction=<br />
The paper <ref>Salakhutdinov, R., & Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. AI and Statistics.</ref> presented here describes a method to learn a nonlinear transformation from the input space to a low-dimensional<br />
feature space in which K-nearest neighbour classification performs well. As the performance of algorithms like K-nearest neighbours (KNN) that are based on computing distances, the main objective of the proposed algorithm is to learn a good similarity measure that can provide insight into how high-dimensional data is organized. The nonlinear transformation is learned by pre-training and fine-tuning a multilayer neural network. The authors also show how to further enhance the performance of non-linear transformation using unlabeled data. Experimental results on a widely used version of the MNIST handwritten digit recognition task show that proposed algorithm achieves a much lower error rate than SVM or standard backpropagation. The unused dimensions for nearest neighbor classifications are used by the method to explicitly represent<br />
transformations of those digits not affecting their identity.<br />
<br />
=Background and Related Work=<br />
Learning a similarity measure (or distance metric) over the input space <math> {\mathbf X} </math> is an important task in machine learning, and is closely related to the feature extraction problem.<br />
A distance metric <math> \mathbf D </math> (e. g. Euclidean) measures the similarity (in the feature space) between two input vectors <math> {\mathbf x}^a, {\mathbf x}^b \in {\mathbf X} </math> by computing <math> \mathbf D[{\mathbf f}(x^a|W),{\mathbf f}(x^b|W)]</math>, where <math> {\mathbf f}(x|W)</math> represents the mapping function from input vector <math> {\mathbf X} </math> to feature space <math> {\mathbf Y} </math> parametrized by <math> {\mathbf W} </math>.<br />
Previous work studied this problem where <math> \mathbf D </math> is the Euclidean distance and <br />
<math> {\mathbf f} </math> is simple linear projection, i.e. <math> {\mathbf f}(x|W)=Wx </math>.<br />
For example Linear discriminant analysis (LDA) learns the matrix <math> W </math> that minimizes the within-class distances to between-class distances ratio. <br />
<br />
Globerson and Roweis <ref> A. Globerson and S. T. Roweis. Metric learning by collapsing<br />
classes. In NIPS, 2005 </ref> proposed a method for learning the matrix <math> W </math> such that the input vectors from the same class are mapped to a tight cluster. Also, Weinberger et.al. <ref> K. Q.Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2005. </ref> also learned <math> W </math> with the goals of both making the K-nearest neighbours belong to the same class and making examples from different classes be separated by a large margin. All these methods rely on linear transformation, which has a limited number of parameters and thus cannot model higher-order correlations between the<br />
original data dimensions.<br />
<br />
=Proposed Method=<br />
In this paper, authors show that a nonlinear transformation function, with many more<br />
parameters, enables us to discover low-dimensional representations of high-dimensional data<br />
that perform much better than existing linear methods provided the dataset is large enough to allow for the parameters to be estimated. Regarding the digit recognition application considered in the paper and adopting a probabilistic approach, one can learn the non-linear transformation by maximizing the log probability of the pairs that occur in the training set. The probability distribution over all possible pairs of images <math> \mathbf x^a, \mathbf x^b </math>, is defined using the squared distances between their codes, <math> {\mathbf f}(x^a),{\mathbf f}(x^b) </math>:<br />
<br />
<br> <center> <math> \mathbf p(x^a,x^b)= \frac{||f(x^a)-f(x^b)||^2}{\sum_{k<l} ||f(x^k)-f(x^l)||^2}</math> </center><br />
<br />
This formulation is quadratic in training cases and is therefore an attempt to model the structure in the pairings, not the structure in the individual images or the mutual information between the code vectors. Thereby, we would require a large number of pairs to train the large number of parameters. An alternative approach used here is based on a recent discovery of an effective and unsupervised algorithm for training a multi-layer, non-linear ”encoder” network that transforms the input data vector <math> \mathbf X </math> into a low-dimensional feature representation <math> \mathbf f(x|W) </math> capturing<br />
a lot of the structure in the input data <ref> G. E. Hinton and R. R. Salakhutdinov. Reducing the<br />
dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. </ref>. <br />
<br />
They proposed algorithm performs two steps: First, the recently discovered unsupervised<br />
algorithm is used in a pre-training stage, i.e.to initialize the parameter vector<br />
<math> W </math> that defines the mapping from input vectors to their low-dimensional representation. Next, the initial parameters are finetuned by performing gradient descent in the objective function defined by Neighbourhood Component Analysis (NCA) method <ref> J. Goldberger, S. T. Roweis, G. E. Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In<br />
NIPS, 2004 </ref>. The resultant learning algorithm is a non-linear transformation<br />
of the input space optimized to make KNN perform well in the low-dimensional feature<br />
space.<br />
<br />
== Neighborhood Component Analysis ==<br />
<br />
Assuming a given set of N labeled training cases <math> (x^a,c^a),\ a=1, 2, 3, \ldots, N </math>, where <math> x^a \in R^d </math>, and <math> c^a \in \{1,2, \ldots, C\} </math>.<br />
For each training vector <math> \mathbf x^a </math>, the probability that point<br />
<math> \mathbf a </math> selects one of its neighbours <math> \mathbf b </math> in the transformed<br />
feature space as, is defined as below: <br />
<br />
<br> <center> <math> p_{ab}=\frac{exp(-d_{ab})}{\sum_{z \neq a} exp(-d_{az})} </math> </center><br />
<br />
Assuming Euclidean distance metric we have:<br />
<br />
<br> <center> <math> \mathbf d_{ab}=||f(x^a|W)-f(x^b|W) ||^2</math> </center><br />
<br />
<br />
If <math> \mathbf f(x|W)=Wx </math> is constrained to be a linear transformation, we get linear NCA. However, here authors define <math> \mathbf f(x|W) </math> using a multi-layer, nonlinear neural network, parametrized by the weight vector <math> W </math>.<br />
The probability that point <math> a </math> belongs to class <math> k </math> depends on the relative proximity of all other data points that belong to class <math> k </math>, i.e. <br />
<br />
<br> <center> <math> \mathbf p(c^a=k)=\sum_{b:c^b=k} p_{ab} </math> </center><br />
<br />
The NCA goal is to maximize the expected number of correctly classified points on the training data:<br />
<br />
<br> <center> <math> \mathbf O_{NCA}= \sum_{a=1}^{N} \sum_{b:c^a=c^b} p_{ab}</math> </center><br />
<br />
In order to maximize the above-mentioned objective function, we need to compute its derivative with respect to vector <math> W </math> for the <math> a^th </math> training case as below<br />
<br />
<br> <center> <math> \mathbf \frac{\partial O_{NCA}}{\partial W} = \mathbf \frac{\partial O_{NCA}}{\partial f(x^a|W)} \mathbf \frac{\partial f(x^a|W)}{\partial W} </math> </center><br />
<br />
where<br />
<br />
<br> <center> <math> \mathbf \frac{\partial O_{NCA}}{\partial f(x^a|W)}= -2 [\sum_{b:c^a=c^b} p_{ab}d_{ab} - \sum_{b:c^a=c^b} p_{ab} [\sum_{z \neq a} p_{az}d_{az}]] + 2[\sum_{l:c^l=c^a} p_{la}d_{la} - \sum_{l \neq a} p_{la}d_{la}[\sum_{q:c^l=c^q} p_{lq}] ] </math> </center><br />
<br />
and <math> \mathbf \frac{\partial f(x^a|W)}{\partial W} </math> is computed using the standard backpropagation algorithm.<br />
<br />
For a more detailed discussion on NCA click on the link below:<br />
<br />
[[Neighbourhood Components Analysis|Neighbourhood Components Analysis]]<br />
<br />
== A short introduction to Restricted Boltzmann Machines (RBM) ==<br />
The pre-training step in the paper models binary data using a Restricted Boltzmann Machine(RBM) in Section 3.1. The presentation there, although readily understandable by readers who are already familiar with RBM, is not easy to absorb by less informed readers. This short introduction serves to fill this gap.<br />
<br />
=== Organization of this short introduction ===<br />
We will first explain the notion of Boltzmann Machine and then explain the advantage of using Restricted Boltzmann Machines.<br />
<br />
=== Boltzmann Machine ===<br />
"A Boltzmann machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off."<ref>http://www.scholarpedia.org/article/Boltzmann_machine</ref> <br />
<br />
[[File:BM.jpg]]<br />
<br />
The above figure is a snapshot of a Boltzmann machine consisting of five interconnected neurons where each neuron has a binary state of either on or off. The interconnection in a Boltzmann machine is formally given by some weight coefficients <math>w_{ij} \,</math> and "symmetrically connected" means that <math>w_{ij} = w_{ji} \,</math>. Each neuron also has a bias coefficient <math>b_i\,</math>. As time evolves, the neurons update their binary states according to a stochastic updating rule, which will be defined below.<br />
<br />
====Stochastic dynamics of a Boltzmann Machine ====<br />
When neuron ''i'' is given the opportunity to update its binary state, it first computes its ''total input'', <math>z_i \,</math>, by summing its own bias and the weights of its adjacent neurons that are on. Formally, <math> z_i = b_i + \sum_j s_j w_{ij} \,</math> where <math>s_j\,</math> is 1 if neuron ''j'' is on and 0 if neuron ''j'' is off. Neuron ''i'' then turns on with the probability <math>prob(s_i=1) = \frac{1}{1 + e^{-z_i}} \,</math>.<br />
<br />
One important and interesting property of Boltzmann Machine is that if the neurons are updated sequentially in order that does depend on the total inputs, the network will eventually reach a stationary distribution (called the Boltzmann distribution), in the sense that the probabilities of state vectors <math>s=\{s_i\}\,</math> converge to a stationary distribution. Defining the energy of a Boltzmann Machine (at a paticular state vector <math>s\,</math>) by <math>E(s) = -\sum_{i<j} w_{ij} \, s_i \, s_j + \sum_i \theta_i \, s_i</math>, the Boltzmann distribution can be written as<br />
<br />
<math> P(s) = \frac{e^{-E(s)}}{\sum_{\text{all possible state vectors} v} e^{-E(v)}} \,</math><br />
<br />
==== Learning in Boltzmann Machine ====<br />
The training data is represented in a Boltzmann Machine as state vectors. Learning in Boltzmann Machine then consists of finding the bias and weights coefficients that define a Boltzmann distribution in which those state vectors have high probabilities. Computational algorithms can be derived by differentiating the above formula and applying a gradient method.<br />
<br />
===== Training with hidden units and visible units =====<br />
Learning becomes more interesting if some of the neurons-like units are "visible"(meaning that their states can be observed) and the other units are "hidden"(meaning that their sates cannot be observed). The hidden units act as latent variables which the Boltzmann Machine uses to model distributions over the visible state vectors. It is remarkable that the learning rule of Boltzmann Machines remains unchanged when there are hidden units.<br />
<br />
===== Restricted Boltzmann Machine =====<br />
In a restricted Boltzmann Machine, every connection is between a visible unit and a hidden unit. This makes the states of the hidden units conditionally independent given a state of the visible units. Figure 2 in the paper, reproduced below, is an illustration of a Restricted Boltzmann Machine.<br />
<br />
[[File:Example.jpg]]<br />
<br />
== Pre-training step ==<br />
<br />
The purpose of this step if to learn the (initial) weights for an adaptive, multi-layer, non-linear encoder network that transforms the input data vector <math> x </math> into its low-dimensional<br />
feature representation <math> \mathbf f(x|W) </math>. Indeed, the pretraining step should find a good region from which to start the subsequent fine-tuning step.<br />
<br />
===Modeling binary data ===<br />
<br />
MNIST dataset contains binary images modeled here using Restricted Boltzmann Machines (RBM) <ref> Y. Freund and D. Haussler. Unsupervised learning of distributions on binary vectors using two layer networks. In Advances in Neural Information Processing Systems 4, pages 912–919, San Mateo, CA., 1992. Morgan Kaufmann </ref>, which are a type of stochastic recurrent neural network.<br />
<br />
The visible stochastic binary input vector <math> \mathbf x </math> and hidden<br />
stochastic binary feature vector <math> \mathbf h </math> are modeled by products of conditional Bernoulli distributions:<br />
<br />
<br> <center> <math> \mathbf p(h_j=1|x)= \sigma(b_j+\sum_{i}W_{ij}x_i) </math> </center><br />
<br />
<br> <center> <math> \mathbf p(x_i=1|h)= \sigma(b_i+\sum_{j}W_{ij}j_j) </math> </center><br />
<br />
Where <math> \sigma(z) = \frac{1}{1+exp(-z)} </math> is the logistic function and <math> \mathbf W_{ij} </math> is a symmetric interaction term between input <math> \mathbf i </math> and feature <math>\mathbf j </math>, and <math> b_i\ and \ b_j </math> are biases. <br />
<br />
<br />
The marginal distribution over visible vector <math> \mathbf x </math> is then:<br />
<br />
<br> <center> <math> \mathbf p(x)=\sum_{h} \frac{exp(-E(h,x))}{\sum_{u,g} exp(-E(u,g))} </math> </center><br />
<br />
Where <math> \mathbf E(x,h) </math> is an energy term defined as below:<br />
<br />
<br> <center> <math> \mathbf E(x,h) = - \sum_{i} b_ix_i - \sum_{j} b_jh_j - \sum_{i,j} x_ih_jW_{ij} </math> </center><br />
<br />
Finally, the parameter update term to perform gradient ascent<br />
in the log-likelihood is <br />
<br />
<br> <center> <math> \mathbf \Delta W_{ij}=\epsilon \frac{\partial log p(x)}{\partial W_{ij}} = \epsilon (<x_i,h_j>_{data} - <x_i,h_j>_{model}) </math> </center><br />
<br />
With <math> \mathbf \epsilon </math> being the learning rate, <math> \mathbf <.>_{data} </math> denoting the the frequency with which input <math> \mathbf i </math> and feature <math> \mathbf j </math> are on together when the features<br />
are being driven by the observed data from the training set, and <math> \mathbf <.>_{model} </math> being the expectation (corresponding to <math> \mathbf <.>_{data} </math>) with respect to the distribution defined by the model.<br />
<br />
As computations for <math> \mathbf <.>_{model} </math> are difficult, the 1-step divergence algorithm is used to circumvent this burden <ref> G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1711–1800, 2002</ref>:<br />
<br />
<br> <center> <math> \mathbf \Delta W_{ij}= \epsilon (<x_i,h_j>_{data} - <x_i,h_j>_{recon}) </math> </center><br />
<br />
Where <math> \mathbf <.>_{recon} </math> is the frequency with which input <math> \mathbf i </math> and feature <math> \mathbf j </math> are on together after stochastically activating features and reconstructing binary data. A simplified version of the above equation is used as the learning rule for the biases.<br />
<br />
===Modeling real value data===<br />
By generalizing RBM's to exponential family distributions, Welling et. al. could model images with real-valued pixels by using visible units that have a Gaussian distribution whose mean is determined by the hidden units:<br />
<math>p(x_i=x|h) = \frac{1}{\sqrt{2*\pi* \sigma_i}}exp(-\frac{(x-b_i-\sigma_i\Sigma_j{h_{j}w_{ij}})^2}{2\sigma_i^2})</math> <br /><br />
<math>p(h=1|x)=\sigma(b_j+\Sigma_i{W_{ij}\frac{x_i}{\sigma_i}})</math><br />
<br />
=== Greedy recursive pretraining ===<br />
<br />
This greedy, layer-by-layer training can be repeated several<br />
times to learn a deep, hierarchical model in which each<br />
layer of features captures strong high-order correlations between<br />
the activities of features in the layer below:<br />
<br />
1. Learn the parameters <math> \mathbf W^1 </math> of a Bernoulli or Gaussian<br />
model.<br />
<br />
2. Freeze the parameters of the lower-level model and use<br />
the activation probabilities of the binary features, when<br />
they are being driven by training data, as the data for<br />
training the next layer of binary features.<br />
<br />
3. Freeze the parameters <math> \mathbf W^2 </math> that define the second layer of<br />
features and use the activation probabilities of those<br />
features as data for training the third layer of features.<br />
<br />
4. Proceed recursively for as many layers as desired.<br />
<br />
== Fine-tuning step ==<br />
<br />
For fine-tuning model parameters using the NCA objective function (introduced above) the method of<br />
conjugate gradients is used. To determine an adequate number of epochs and<br />
avoid overfitting, only a fraction of the training data is used for fine-tuning, and then performance is tested on the remaining validation data.<br />
<br />
==Regularized Nonlinear NCA==<br />
<br />
In many applications, a large supply of unlabeled data is readily available but the amount of labeled data (which typically requires expert knowledge to produce), is very limited. In order to take advantage of the information in the unlabeled data to enhance nonlinear NCA performance, the regularized NCA framework is proposed. Once the pretarining step for individual RBM's is each level of the network is done, one can replace the stochastic activities of the binary<br />
features by deterministic, real-valued probabilities. This allows for backpropagating through the entire network to fine-tune the weights for optimal reconstruction of the data.<br />
<br />
Performing such training step does not require labeled data and produces low-dimensional codes<br />
that are good at reconstructing the input data vectors, and tend to preserve class neighbourhood structure. Accordingly, a new objective function <math> C </math> is formulated combining NCA and autoencoder objective functions:<br />
<br />
<br> <center> <math> \mathbf C=\lambda O_{NCA}+(1-\lambda)(-E) </math> </center><br />
<br />
Where <math> \mathbf E </math> is the reconstruction error and <math> \mathbf \lambda </math> is a trade-off parameter.<br />
<br />
For example assuming a semi-supervised learning setting, consider having a set of <math> \mathbf N_l </math> labeled training data <math> \mathbf (x^l,c^l) </math>, and a set of <math> \mathbf N_u </math> unlabeled training data <math> \mathbf x^u </math>. The overall objective function would then be as below:<br />
<br />
<br> <center> <math> \mathbf O=\lambda \frac{1}{N_l} \sum_{l=1}^{N_l} \sum_{b:c^l=c^b} p_{lb}+ <br />
(1-\lambda) \frac{1}{N} \sum_{n=1}^{N} -E^n</math> </center><br />
<br />
Where <math> \mathbf N=N_l+N_u </math> and <math> \mathbf E^n </math> is the reconstruction error for the input <math> \mathbf x^n </math>.<br />
<br />
===Splitting codes into class-relevant and class-irrelevant parts===<br />
<br />
Accurate reconstruction of a digit image requires the code to contain information about aspects of the image such as its orientation, slant, size and stroke thickness. These aspects are not necessarily relevant to digit class may inevitably contribute to the Euclidean distance between codes<br />
and thus harm classification. To eliminate such unwanted effect, a 50-dimensional code is used, but only the first 30 dimensions are used in the NCA objective function calculations. The remaining<br />
20 dimensions were free to code all those aspects of an image that do not affect its class label but are important for reconstruction, and are therefore used to compute the reconstruction errors.<br />
<br />
=Experiments=<br />
<br />
Experimental results presented for the MNIST dataset containing 60000 training and 10000 test images of 28x28 handwritten digits, show the superior performance of the proposed algorithm (i.e. nonlinear NCA). An 28x28-500-500-2000-30 architecture is used for the multi-layer neural network. <br />
<br />
Obtained results show that nonlinear NCA achieves minimum error rate of 1.01% using 7 nearest neighbours. This compares favorably to the best reported error rates (without using any domain-specific knowledge) of 1.6% for randomly initialized backpropagation and 1.4% for Support Vector Machines.<br />
<br />
Results also show that Regularized nonlinear NCA not only performs better than the nonlinear NCA on unlabeled data but also performs slightly better on labelled data.<br />
<br />
<center>[[File:Embedding-Fig4.JPG]]</center><br />
<br />
=Summary=<br />
It has been shown how to pretrain and fine-tune a deep nonlinear encoder network to learn a similarity metric over the input space facilitating nearest neighbor classification.<br /><br />
The method achieved the least error on a version of MINST handwritten digit recognition task without using any domain specific knowledge.<br /><br />
The classification accuracy is high even with the limited labeled training data.<br />
<br />
=References=<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Embedding-Fig4.JPG&diff=3847File:Embedding-Fig4.JPG2009-08-05T18:24:38Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3845convex and Semi Nonnegative Matrix Factorization2009-08-05T04:44:42Z<p>Amir: /* C. Shifting mixed-sign data to nonnegative */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
<br />
=== C. Shifting mixed-sign data to nonnegative ===<br />
<br />
In this section we used only nonnegative by adding the smallest constant so all entries are nonnegative and performed experiments on data shifted in this way for the Wave and Ionosphere data. For Wave, the accuracy decreases to 0.503 from 0.590 for Semi-NMF and decreases to 0.5297 from 0.5738 for Convex-NMF. The sparsity increases to 0.586 from 0.498 for Convex-NMF. For Ionosphere, the accuracy decreases to 0.647 from 0.729 for Semi-NMF and decreases to 0.618 from 0.6877 for Convex-NMF. The sparsity increases to 0.829 from 0.498 for Convex-NMF. <br />
<br />
<center>[[File:Convex-Fig2.JPG]]</center><br />
<br />
In short, the shifting approach does not appear to provide a satisfactory alternative.<br />
<br />
=== D. Flexibility of NMF ===<br />
In general NMF almost always performs better than K-means in terms of clustering accuracy while providing a matrix approximation. This could be due to the flexibility of matrix factorization as compared to the rigid spherical clusters that the K-means clustering objective function attempts to capture. When the data distribution is far from a spherical clustering, NMF may have advantages. Figure 2 gives an example. The dataset consists of two parallel rods in 3D space containing 200 data points. The two central axes of the rods are 0.3 apart and they have diameter 0.1 and length 1. As seen in the figure, K-means gives a poor clustering, while NMF yields a good clustering. The bottom panel of Figure 2 shows the differences in the columns of G (each column is normalized to Pi gk(i) = 1). The mis-clustered points have small differences. Note that NMF is initialized randomly for the different runs. The stability of the solution over multiple runs was investigated; The results indicate that NMF converges to solutions F and G that are very similar across runs; moreover, the resulting discretized cluster indicators were identical.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3844convex and Semi Nonnegative Matrix Factorization2009-08-05T04:36:28Z<p>Amir: /* B. Real life datasets */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
<br />
=== C. Shifting mixed-sign data to nonnegative ===<br />
<br />
In this section we used only nonnegative by adding the smallest constant so all entries are nonnegative and performed experiments on data shifted in this way for the Wave and Ionosphere data. For Wave, the accuracy decreases to 0.503 from 0.590 for Semi-NMF and decreases to 0.5297 from 0.5738 for Convex-NMF. The sparsity increases to 0.586 from 0.498 for Convex-NMF. For Ionosphere, the accuracy decreases to 0.647 from 0.729 for Semi-NMF and decreases to 0.618 from 0.6877 for Convex-NMF. The sparsity increases to 0.829 from 0.498 for Convex-NMF. <br />
<br />
<center>[[File:Convex-Fig2.JPG]]</center><br />
<br />
In short, the shifting approach does not appear to provide a satisfactory alternative.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Convex-Fig2.JPG&diff=3843File:Convex-Fig2.JPG2009-08-05T04:34:03Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Convex-Table1.JPG&diff=3842File:Convex-Table1.JPG2009-08-05T04:23:56Z<p>Amir: uploaded a new version of "File:Convex-Table1.JPG"</p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3841convex and Semi Nonnegative Matrix Factorization2009-08-05T04:20:40Z<p>Amir: /* A. Synthetic dataset */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Convex-Table1.JPG&diff=3839File:Convex-Table1.JPG2009-08-05T03:59:25Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3838convex and Semi Nonnegative Matrix Factorization2009-08-05T03:55:05Z<p>Amir: /* Experimental Results */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Convex-Fig1.JPG&diff=3837File:Convex-Fig1.JPG2009-08-05T03:52:59Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3836convex and Semi Nonnegative Matrix Factorization2009-08-05T03:34:13Z<p>Amir: /* A. Synthetic dataset */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
---A. Synthetic dataset ---<br />
<br />
One main theme of our work is that the Convex-NMF variants may provide subspace factorizations<br />
that have more interpretable factors than those obtained by other NMF variants (or<br />
PCA). In particular, we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. We<br />
begin with a simple investigation of this hypothesis. In Figure 1, we randomly generate four<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3835convex and Semi Nonnegative Matrix Factorization2009-08-05T03:33:52Z<p>Amir: /* Experimental Results */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
===A. Synthetic dataset ===<br />
<br />
One main theme of our work is that the Convex-NMF variants may provide subspace factorizations<br />
that have more interpretable factors than those obtained by other NMF variants (or<br />
PCA). In particular, we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. We<br />
begin with a simple investigation of this hypothesis. In Figure 1, we randomly generate four<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=kernelized_Sorting&diff=3834kernelized Sorting2009-08-05T03:27:30Z<p>Amir: /* Application */</p>
<hr />
<div>Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. <br />
<br />
==Introduction==<br />
===Problem Statement===<br />
Assume we are given two collections of documents purportedly covering the same content, written in two different languages. Can we determine the correspondence between these two sets of documents without using a dictionary?<br />
<br />
===Sorting and Matching===<br />
(Formal) problem formulation:<br />
<br />
Given two sets of observations <math> X= \{ x_{1},...,<br />
x_{m} \}\subseteq \mathcal X</math> and <math>Y=\{ y_{1},..., y_{m}\}\subseteq \mathcal Y </math><br />
<br />
Find a permutation matrix <math>\pi \in \Pi_{m}</math>,<br />
<br />
<math> \Pi_{m}:= \{ \pi | \pi \in \{0,1\}^{m \times m} where<br />
\pi 1_{m}=1_{m}, <br />
\pi^{T}1_{m}=1_{m}\}</math><br />
<br />
such that <math> \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math> is maximally dependent. Here <math>1_{m} \in \mathbb{R}^{m}</math> is the<br />
vector of all ones.<br />
<br />
Denote by <math>D(Z(\pi))</math> a measure of the dependence between x and y, where <math> Z(\pi) := \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math>. <br />
<br />
Then we define nonparametric sorting of X and Y as follows<br />
<br />
<math><br />
\pi^{\ast}:=\arg\max_{\pi \in \prod_{m}}D(Z(\pi)).<br />
</math><br />
<br />
==Hilbert Schmidt Independence Criterion==<br />
<br />
Let sets of observations X and Y be drawn jointly from some probability distribution <math>Pr_{xy}</math>. The Hilbert Schmidt Independence Criterion (HSIC) measures the dependence between x and y by computing the norm of the cross-covariance operator over the domain <math> \mathcal X \times \mathcal Y</math> in Hilbert Space.<br />
<br />
let <math>\mathcal {F}</math> be the Reproducing Kernel Hilbert Space (RKHS) on<br />
<math>\mathcal {X}</math> with associated kernel <math>k: \mathcal X \times \mathcal X \rightarrow<br />
\mathbb{R}</math> and feature map <math>\phi: \mathcal X \rightarrow \mathcal {F}</math>.<br />
Let <math>\mathcal {G}</math> be the RKHS on <math>\mathcal Y</math> with kernel <math>l</math> and<br />
feature map <math>\psi</math>. The cross-covariance operator <math>C_{xy}:\mathcal<br />
{G}\rightarrow \mathcal {F}</math> is defined by<br />
<br />
<math><br />
C_{xy}=\mathbb{E}_{xy}[(\phi(x)-\mu_{x})\otimes (\psi(y)-\mu_{y})],<br />
</math><br />
<br />
where <math>\mu_{x}=\mathbb{E}[\phi(x)]</math>, <math>\mu_{y}=\mathbb{E}[\psi(y)]</math>.<br />
<br />
HSIC is the square of the Hilbert-Schmidt norm of the cross covariance operator <math>\, C_{xy}</math><br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Pr_{xy}):=\parallel C_{xy}<br />
\parallel_{HS}^{2}.<br />
</math><br />
<br />
In term of kernels, HSIC can be expressed as<br />
<br />
<math><br />
\mathbb{E}_{xx'yy'}[k(x,x')l(y,y')]+\mathbb{E}_{xx'}[k(x,x')]\mathbb{E}_{yy'}[l(y,y')]-2\mathbb{E}_{xy}[\mathbb{E}_{x'}[k(x,x')]\mathbb{E}_{y}[l(y,y')]].<br />
</math><br />
<br />
where <math>\mathbb{E}_{xx'yy'}</math> is the expectation over both <math>\ (x, y)</math> ~<br />
<math>\ Pr_{xy}</math> and an additional pair of variables <math>\ (x', y')</math> ~ <math>\ Pr_{xy}</math><br />
drawn independently according to the same law.<br />
<br />
A biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, <math>H_{ij}=\delta_{ij}-m^{-1}</math><br />
centers the data and the labels in feature space, <math>\bar{K}:=HKH</math> and<br />
<math>\bar{L}:=HLH</math> denote the centered versions <math>K</math> and <math>L</math> respectively.<br />
<br />
Advantages of HSIC are:<br />
<br />
Computing HSIC is simple: only the kernel matrices K and L are needed;<br />
<br />
HSIC satisfies concentration of measure conditions, i.e. for random draws of observation from <math>Pr_{xy}</math>, HSIC provides values which are very similar;<br />
<br />
Incorporating prior knowledge into the dependence estimation can be done via<br />
kernels.<br />
<br />
==Kernelized Sorting==<br />
===Kernelized Sorting===<br />
'''Claim: ''' Thr problem is equivalent to the optimization problem of <br />
<math><br />
\pi^{\ast}=\arg\max_{\pi \in \Pi_{m}}[tr \bar{K}<br />
\pi^{T}\bar{L}\pi]<br />
</math><br />
<br />
'''Proof''': Firstly, we need to establish <math>H</math> and <math>\pi</math> matrices commute.<br />
<br />
Since <math>H</math> is a centering matrix, we can write it as <math>H=I_{n}-11^{T}</math>.<br />
<br />
Actually, note that <math>\ H\pi=\pi H</math> iff <math>\ (I_{n}-11^{T})\pi=\pi (I_{n}-11^{T})</math> iff <math>\ 11^{T}\pi=\pi 11^{T}</math>, a result follows.<br />
<br />
Next, recall that the biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, i.e. <math>K=xx^{T}</math> and <math>L=yy^{T}</math>.<br />
<br />
Now, for any given pair <math>(x, y_{r})</math> between <math>X</math> and <math>Y</math>, we have <math>y_{r}=\pi y</math>.<br />
<br />
Note that <math>\pi</math> is a permutation matrix, we have <math>y=\pi^{T} y_{r}</math>, so the kernel matrix <math>L=\pi^{T}y_{r}y_{r}^{T}\pi</math>.<br />
<br />
Note that the kernel matrix <math>L_{r}=y_{r}y_{r}^{T}</math>, so the kernel matrix <math>L=\pi^{T}L_{r}\pi</math>.<br />
<br />
Note that <math>tr HKHL = tr HKHHLH </math>, since <math>H</math> is idempotent.<br />
<br />
So we have <math>tr HKHL = tr HKHHLH = tr \bar K H\pi^{T}L_{r}\pi H = tr \bar K \pi^{T}HL_{r}H\pi = tr \bar K \pi^{T}\bar L_{r}\pi </math>. <br />
<br />
Clearly, it is just our objective function.<br />
<br />
====Sorting as a special case====<br />
For general kernel matrices <math>K \,</math> and <math>L \,</math>, where <math>K_{ij}=k(x_i,x_j) \,</math> and <math>L_{ij}=l(x_i,x_j) \,</math>, the objective of the kernelized sorting problem, as explained above, is to find the permutation matrix <math> \pi \,</math> which maximizes <math>tr(\bar{K} \pi^{T}\bar{L}\pi ) = tr(HKH\pi^{T}HLH\pi)\, </math>.<br />
<br />
In the special case where the kernel functions <math>k\,</math> and <math>l\,</math> are the inner product in Euclidean space, we have <math>K=xx^{T}\,</math> and <math>L=yy^{T}\,</math>. Hence, we can rewrite the objective as <br />
<br />
<math>tr(HKH\pi^{T}HLH\pi) = tr(Hxx^{T}H\pi^{T}Hyy^{T}H\pi) = tr[Hx(Hx)^T\pi^{T}Hy(Hy)^T\pi] = tr[((Hx)^T\pi^{T}Hy) ((Hy)^T\pi Hx))]\,</math>, where the last step uses the property that trace is invariant under cyclic permutations.<br />
<br />
Note that <math>(Hx)^T\pi^{T}Hy \, </math> and <math> (Hy)^T\pi Hx = (Hx)^T\pi^{T}Hy \,</math> are scalars, therefore the objective is equal to <math> [(Hx)^T\pi (Hy)]^2 \,</math>.<br />
<br />
In the even more special case where the Euclidean space is the real line and the inner product is multiplication of real numbers, the centering matrix <math>H\,</math> merely translates the sample vector <math>y \,</math> (by the sample mean) and thus the order of <math>y \,</math> is preserved. Hence, maximizing <math> [(Hx)^T\pi (Hy)]^2 \,</math> can be solved by maximizing <math>x^T \pi y \,</math>. Under the further assumption that <math>x \,</math> is sorted ascendingly, maximizing <math> x^T \pi y \,</math> is equivalent to sorting <math>y \,</math> ascendingly, according to the Polya-Littlewood-Hardy inequality.<br />
<br />
===Diagonal Dominance===<br />
Replace the expectations by sums where no pairwise summation indices are identical. This leads to the objective function:<br />
<br />
<math><br />
\frac{1}{m(m-1)}\sum_{i\ne<br />
j}K_{ij}L_{ij}+\frac{1}{m^{2}(m-1)^{2}}\sum_{i\ne j,u\ne<br />
v}K_{ij}L_{uv}- \frac{2}{m(m-1)^2}\sum_{i,j\ne i,v\ne i}K_{ij}L_{iv}<br />
</math><br />
<br />
Using the <math>\bar{K}_{ij}=K_{ij}(1-\delta_{ij})</math> and<br />
<math>\bar{L}_{ij}=L_{ij}(1-\delta_{ij})</math> for kernel matrices where<br />
the main diagonal terms have been removed we arrive at the<br />
expression <math>(m-1)^{-1}tr<br />
H\bar{L}H\bar{K}</math>.<br />
<br />
===Relaxation to a constrained eigenvalue problem===<br />
An approximate solution of the problem by solving <br><br />
<br />
<math><br />
\text{maximize}_{\eta} \left\{ \eta^{T}M\eta \right\} \text{subject to} A\eta=b<br />
</math><br />
<br />
Here the matrix <math>M=K\otimes L\in \mathbb{R}^{m^{2}\times{m^2}}</math> is<br />
given by the outer product of the constituting kernel matrices,<br />
<math>\eta \in \mathbb{R}^{m^2}</math> is a vectorized version of the<br />
permutation matrix <math>\pi</math>, and the constraints imposed by <math>A</math> and <math>b</math><br />
amount to the polytope constraints imposed by <math>\Pi_{m}</math>.<br />
<br />
===Related Work===<br />
Mutual Information is defined as, <math>I(X,Y)=h(X)+h(Y)-h(X,Y)</math>. We can<br />
approximate MI maximization by maximizing its lower bound. This then<br />
corresponds to minimizing an upper bound on the joint<br />
entropy <math>h(X,Y)</math>.<br />
<br />
Optimization<br />
<br />
<math><br />
\pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|,<br />
</math><br />
<br />
where <math>\ J_{ij}=K_{ij}L_{\pi(i),\pi(j)}</math>. This is related to the<br />
optimization criterion proposed by Jebara(2004) in the context of<br />
aligning bags of observations by sorting via minimum volume PCA.<br />
<br />
===Multivariate Extensions===<br />
Let there be T random variables <math>x_i \in {\mathcal X}_i</math> which are jointly drawn from some distribution <math>p(x_1,...x_m)</math>. The expectation operator with respect to the joint distribution and with respect to the product of the marginals is given by<br />
<br />
<math><br />
\mathbb{E}_{x_1,...,x_T}[\prod_{i=1}^{T}k_{i}(x_{i},\cdot)]</math> and <math>\prod_{i=1}^{T}\mathbb{E}_{x_i}[k_{i}(x_{i},\cdot)]<br />
</math><br />
<br />
respectively. Both terms are equal if and only if all random variables are independent. The squared difference between both is given by<br />
<br />
<math><br />
\mathbb{E}_{x_{i=1}^T,{x'}_{i=1}^{T}}[\prod_{i=1}^{T}k_{i}(x_{i},x_{i}^{'})]+\prod_{i=1}^{T}\mathbb{E}_{x_{i},x_{i}^{'}}[k_{i}(x_{i},x_{i}^{'})]-2\mathbb{E}_{x_{i=1}^{T}}[\prod_{i=1}^{T}\mathbb{E}_{x_{i}^{'}}[k(x_{i},x_{i}^{'})]]<br />
</math><br />
<br />
which we refer to as multiway HSIC.<br />
<br />
Denote by <math>K_{i}</math> the kernel matrix obtained from the kernel <math>k_{i}</math> on the set of observations <math>X_{i}:=\{x_{i1},...,x_{im}\}</math>, the empirical estimate is given by<br />
<br />
<math><br />
HSIC[X_{1},...,X_{T}]:=1_{m}^{T}(\bigodot_{i=1}^{T}K_{i})1_{m}+\prod_{i=1}^{T}1_{m}^{T}K_{i}1_{m}-2\cdot1_{m}^{T}(\bigodot_{i=1}^{T}K_{i}1_{m})<br />
</math><br />
<br />
where <math>\bigodot_{t=1}^{T}\ast</math> denotes elementwise product of its arguments. To apply this to sorting we only need to define T permutation matrices <math>\pi_{i} \in \Pi_{m}</math> and replace the kernel matrices <math>K_{i}</math> by <math>\pi_{i}^{T}K_{i}\pi_{i}</math>.<br />
<br />
==Optimization==<br />
===Convex Objective and Convex Domain===<br />
<br />
Define <math>\pi</math> as a doubly stochastic matrix,<br />
<br />
<math><br />
P_{m}:=\{\pi \in \mathbb{R}^{m \times m} where<br />
\pi_{ij}\geqslant 0 and \sum_{i}\pi_{ij}=1 and \sum_{j}\pi_{ij}=1\}<br />
</math><br />
<br />
The objective function <br />
<math>tr K<br />
\pi^{T}L\pi</math> is convex in <math>\pi</math> .<br />
<br />
===Convex-Concave Procedure===<br />
<br />
Compute successive linear lower bounds and maximize<br />
<math><br />
\pi_{i+1}\leftarrow \arg\max_{\pi \in P_{m}}[tr<br />
\bar{K} \pi^{T}\bar{L} \pi_{i}]<br />
</math><br />
<br />
This will converge to a local maximum.<br />
<br />
Initialization is done via sorted principal eigenvector.<br />
<br />
===An tentative explanation for this part===<br />
Basically, I think the optimizing method used in this paper does not apply the Concave Convex Procedure exactly. As I said on Tuesday, I think it just "borrowed" the idea form the Concave Convex Procedure since there is no concave part in this question.<br />
<br />
Accoding to the paper, the Concave Convex Procedure works as<br />
follows: <math>f(x)=g(x)-h(x)</math>, where <math>g</math> is convex and <math>h</math> is concave, a<br />
lower bound can be found by<br />
<br />
<math><br />
f(x) \ge g(x_{0}) + \langle x-x_{0},\partial_{x}g(x_{0}) \rangle<br />
-h(x)<br />
</math><br />
<br />
For the objecitve function in the kernelized sorting method, it can be written in the following format<br />
<br />
<math>f(\pi)=g(\pi)= tr \bar{K}<br />
\pi^{T}\bar{L}\pi</math><br />
<br />
Currently, suppose we have <math>\pi_{0}</math> and <math>g(\pi_{0}) = tr\bar{K} \pi_{0}^{T}\bar{L}\pi_{0}</math>.<br />
<br />
We know that <math>\bigtriangledown_{A} tr ABA^{T}C=CAB+C^{T}AB^{T}</math>.<br />
<br />
So <math> \bigtriangledown_{\pi} tr \bar K<br />
\pi^{T} \bar L \pi=\bigtriangledown_{\pi} tr \pi \bar K \pi^{T} \bar L = \bar L \pi \bar K+\bar L^{T} \pi \bar K^{T}</math>.<br />
Since <math>\bar K</math> and <math>\bar L</math> are symmetric matrix, we get<br />
<br />
<math>\bigtriangledown_{\pi} tr \pi \bar<br />
K \pi^{T} \bar L = 2\bar L \pi \bar K</math>.<br />
<br />
Hence, we get<br />
<math><br />
\langle \pi - \pi_{0}, \bigtriangledown_{\pi} tr \pi_{0} \bar K \pi_{0}^{T} \bar L\rangle <br />
=\langle \pi - \pi_{0}, 2\bar L \pi_{0} \bar K\rangle<br />
=2tr (\pi - \pi_{0})^{T}\bar L \pi_{0} \bar K<br />
=2tr \bar K(\pi - \pi_{0})^{T}\bar L \pi_{0}<br />
</math><br />
<br />
In this case, we can get<br />
<math><br />
f(\pi)\ge tr <br />
\bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + 2tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
We can drop the coefficient <math>2</math> since we just want to maximize this<br />
lower bound, so we get<br />
<math>f(\pi)\ge tr \bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
Hence, we get<br />
<math>f(\pi)\ge tr <br />
\bar{K} \pi^{T}\bar{L}\pi_{0} </math><br />
<br />
So this is the lower bound we want to update repeatedly.<br />
<br />
Actually, I think if the Kernel matrices <math>K</math> and <math>L</math> are well defined and computed already, there is no too much parameters in this optimization problem, which means some stochastic gradient descent method can be applied. In fact, I think this question is easier than the assignment problem based the same argument about the complexity of parameters. Hence, interpreting it as a TSP problem and for example, using Simulated Annealing algorithm, is an acceptable method.<br />
<br />
== Application ==<br />
<br />
Assume that we may want to visualize data according to the metric structure inherent in it. More specifically, our objective is to align it <br />
according to a given template, such as a grid, a torus, or any other fixed structure. Such problems occur when presenting images or documents to a user. Most of the algorithms for low dimensional object layout suffer from the problem that the low dimensional presentation is nonuniform. This has the advantage of revealing cluster structure but given limited screen size the presentation is undesirable. To address this problem, we can use kernelized sorting to align objects. In this scenario the kernel matrix L is given by the similarity measure between the objects <math> x_i </math> that are to be aligned. The kernel K, on the other hand, denotes the similarity between the locations where objects are to be aligned to. For the sake of simplicity we used a Gaussian RBF kernel between the objects to laid out and also between the positions of the grid, i.e. <math>\mathbf{k(x,x') = exp(-gamma ||x-x'||^2) }</math>. The kernel width <math>\mathbf{\gamma }</math> was adjusted to the inverse of <math> \mathbf{||x-x'||^2} </math> such that the argument of the exponential is O(1).<br />
<br />
We obtained 284 images from http://www.flickr.com which were resized and downsampled to 40*40 pixels. We converted the images from RGB into Lab color space, yielding 40*40*3 dimensional objects. The grid, corresponding to X is a ‘NIPS 2008’ letters on which the images <br />
are to be laid out. After sorting we display the images according to their matching coordinates (Figure 1). <br />
<center>[[File:Kernelized Sorting-Fig1.JPG]]</center><br />
We can see images with similar color composition are found at proximal locations.<br />
<br />
==Summary==<br />
<br />
We generalize sorting by maximizing dependency between matched pairs of observations via HSIC.<br />
<br />
Applications of our proposed sorting algorithm range from data visualization to image, data attribute and multilingual document matching.</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=kernelized_Sorting&diff=3833kernelized Sorting2009-08-05T03:24:18Z<p>Amir: /* Application */</p>
<hr />
<div>Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. <br />
<br />
==Introduction==<br />
===Problem Statement===<br />
Assume we are given two collections of documents purportedly covering the same content, written in two different languages. Can we determine the correspondence between these two sets of documents without using a dictionary?<br />
<br />
===Sorting and Matching===<br />
(Formal) problem formulation:<br />
<br />
Given two sets of observations <math> X= \{ x_{1},...,<br />
x_{m} \}\subseteq \mathcal X</math> and <math>Y=\{ y_{1},..., y_{m}\}\subseteq \mathcal Y </math><br />
<br />
Find a permutation matrix <math>\pi \in \Pi_{m}</math>,<br />
<br />
<math> \Pi_{m}:= \{ \pi | \pi \in \{0,1\}^{m \times m} where<br />
\pi 1_{m}=1_{m}, <br />
\pi^{T}1_{m}=1_{m}\}</math><br />
<br />
such that <math> \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math> is maximally dependent. Here <math>1_{m} \in \mathbb{R}^{m}</math> is the<br />
vector of all ones.<br />
<br />
Denote by <math>D(Z(\pi))</math> a measure of the dependence between x and y, where <math> Z(\pi) := \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math>. <br />
<br />
Then we define nonparametric sorting of X and Y as follows<br />
<br />
<math><br />
\pi^{\ast}:=\arg\max_{\pi \in \prod_{m}}D(Z(\pi)).<br />
</math><br />
<br />
==Hilbert Schmidt Independence Criterion==<br />
<br />
Let sets of observations X and Y be drawn jointly from some probability distribution <math>Pr_{xy}</math>. The Hilbert Schmidt Independence Criterion (HSIC) measures the dependence between x and y by computing the norm of the cross-covariance operator over the domain <math> \mathcal X \times \mathcal Y</math> in Hilbert Space.<br />
<br />
let <math>\mathcal {F}</math> be the Reproducing Kernel Hilbert Space (RKHS) on<br />
<math>\mathcal {X}</math> with associated kernel <math>k: \mathcal X \times \mathcal X \rightarrow<br />
\mathbb{R}</math> and feature map <math>\phi: \mathcal X \rightarrow \mathcal {F}</math>.<br />
Let <math>\mathcal {G}</math> be the RKHS on <math>\mathcal Y</math> with kernel <math>l</math> and<br />
feature map <math>\psi</math>. The cross-covariance operator <math>C_{xy}:\mathcal<br />
{G}\rightarrow \mathcal {F}</math> is defined by<br />
<br />
<math><br />
C_{xy}=\mathbb{E}_{xy}[(\phi(x)-\mu_{x})\otimes (\psi(y)-\mu_{y})],<br />
</math><br />
<br />
where <math>\mu_{x}=\mathbb{E}[\phi(x)]</math>, <math>\mu_{y}=\mathbb{E}[\psi(y)]</math>.<br />
<br />
HSIC is the square of the Hilbert-Schmidt norm of the cross covariance operator <math>\, C_{xy}</math><br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Pr_{xy}):=\parallel C_{xy}<br />
\parallel_{HS}^{2}.<br />
</math><br />
<br />
In term of kernels, HSIC can be expressed as<br />
<br />
<math><br />
\mathbb{E}_{xx'yy'}[k(x,x')l(y,y')]+\mathbb{E}_{xx'}[k(x,x')]\mathbb{E}_{yy'}[l(y,y')]-2\mathbb{E}_{xy}[\mathbb{E}_{x'}[k(x,x')]\mathbb{E}_{y}[l(y,y')]].<br />
</math><br />
<br />
where <math>\mathbb{E}_{xx'yy'}</math> is the expectation over both <math>\ (x, y)</math> ~<br />
<math>\ Pr_{xy}</math> and an additional pair of variables <math>\ (x', y')</math> ~ <math>\ Pr_{xy}</math><br />
drawn independently according to the same law.<br />
<br />
A biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, <math>H_{ij}=\delta_{ij}-m^{-1}</math><br />
centers the data and the labels in feature space, <math>\bar{K}:=HKH</math> and<br />
<math>\bar{L}:=HLH</math> denote the centered versions <math>K</math> and <math>L</math> respectively.<br />
<br />
Advantages of HSIC are:<br />
<br />
Computing HSIC is simple: only the kernel matrices K and L are needed;<br />
<br />
HSIC satisfies concentration of measure conditions, i.e. for random draws of observation from <math>Pr_{xy}</math>, HSIC provides values which are very similar;<br />
<br />
Incorporating prior knowledge into the dependence estimation can be done via<br />
kernels.<br />
<br />
==Kernelized Sorting==<br />
===Kernelized Sorting===<br />
'''Claim: ''' Thr problem is equivalent to the optimization problem of <br />
<math><br />
\pi^{\ast}=\arg\max_{\pi \in \Pi_{m}}[tr \bar{K}<br />
\pi^{T}\bar{L}\pi]<br />
</math><br />
<br />
'''Proof''': Firstly, we need to establish <math>H</math> and <math>\pi</math> matrices commute.<br />
<br />
Since <math>H</math> is a centering matrix, we can write it as <math>H=I_{n}-11^{T}</math>.<br />
<br />
Actually, note that <math>\ H\pi=\pi H</math> iff <math>\ (I_{n}-11^{T})\pi=\pi (I_{n}-11^{T})</math> iff <math>\ 11^{T}\pi=\pi 11^{T}</math>, a result follows.<br />
<br />
Next, recall that the biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, i.e. <math>K=xx^{T}</math> and <math>L=yy^{T}</math>.<br />
<br />
Now, for any given pair <math>(x, y_{r})</math> between <math>X</math> and <math>Y</math>, we have <math>y_{r}=\pi y</math>.<br />
<br />
Note that <math>\pi</math> is a permutation matrix, we have <math>y=\pi^{T} y_{r}</math>, so the kernel matrix <math>L=\pi^{T}y_{r}y_{r}^{T}\pi</math>.<br />
<br />
Note that the kernel matrix <math>L_{r}=y_{r}y_{r}^{T}</math>, so the kernel matrix <math>L=\pi^{T}L_{r}\pi</math>.<br />
<br />
Note that <math>tr HKHL = tr HKHHLH </math>, since <math>H</math> is idempotent.<br />
<br />
So we have <math>tr HKHL = tr HKHHLH = tr \bar K H\pi^{T}L_{r}\pi H = tr \bar K \pi^{T}HL_{r}H\pi = tr \bar K \pi^{T}\bar L_{r}\pi </math>. <br />
<br />
Clearly, it is just our objective function.<br />
<br />
====Sorting as a special case====<br />
For general kernel matrices <math>K \,</math> and <math>L \,</math>, where <math>K_{ij}=k(x_i,x_j) \,</math> and <math>L_{ij}=l(x_i,x_j) \,</math>, the objective of the kernelized sorting problem, as explained above, is to find the permutation matrix <math> \pi \,</math> which maximizes <math>tr(\bar{K} \pi^{T}\bar{L}\pi ) = tr(HKH\pi^{T}HLH\pi)\, </math>.<br />
<br />
In the special case where the kernel functions <math>k\,</math> and <math>l\,</math> are the inner product in Euclidean space, we have <math>K=xx^{T}\,</math> and <math>L=yy^{T}\,</math>. Hence, we can rewrite the objective as <br />
<br />
<math>tr(HKH\pi^{T}HLH\pi) = tr(Hxx^{T}H\pi^{T}Hyy^{T}H\pi) = tr[Hx(Hx)^T\pi^{T}Hy(Hy)^T\pi] = tr[((Hx)^T\pi^{T}Hy) ((Hy)^T\pi Hx))]\,</math>, where the last step uses the property that trace is invariant under cyclic permutations.<br />
<br />
Note that <math>(Hx)^T\pi^{T}Hy \, </math> and <math> (Hy)^T\pi Hx = (Hx)^T\pi^{T}Hy \,</math> are scalars, therefore the objective is equal to <math> [(Hx)^T\pi (Hy)]^2 \,</math>.<br />
<br />
In the even more special case where the Euclidean space is the real line and the inner product is multiplication of real numbers, the centering matrix <math>H\,</math> merely translates the sample vector <math>y \,</math> (by the sample mean) and thus the order of <math>y \,</math> is preserved. Hence, maximizing <math> [(Hx)^T\pi (Hy)]^2 \,</math> can be solved by maximizing <math>x^T \pi y \,</math>. Under the further assumption that <math>x \,</math> is sorted ascendingly, maximizing <math> x^T \pi y \,</math> is equivalent to sorting <math>y \,</math> ascendingly, according to the Polya-Littlewood-Hardy inequality.<br />
<br />
===Diagonal Dominance===<br />
Replace the expectations by sums where no pairwise summation indices are identical. This leads to the objective function:<br />
<br />
<math><br />
\frac{1}{m(m-1)}\sum_{i\ne<br />
j}K_{ij}L_{ij}+\frac{1}{m^{2}(m-1)^{2}}\sum_{i\ne j,u\ne<br />
v}K_{ij}L_{uv}- \frac{2}{m(m-1)^2}\sum_{i,j\ne i,v\ne i}K_{ij}L_{iv}<br />
</math><br />
<br />
Using the <math>\bar{K}_{ij}=K_{ij}(1-\delta_{ij})</math> and<br />
<math>\bar{L}_{ij}=L_{ij}(1-\delta_{ij})</math> for kernel matrices where<br />
the main diagonal terms have been removed we arrive at the<br />
expression <math>(m-1)^{-1}tr<br />
H\bar{L}H\bar{K}</math>.<br />
<br />
===Relaxation to a constrained eigenvalue problem===<br />
An approximate solution of the problem by solving <br><br />
<br />
<math><br />
\text{maximize}_{\eta} \left\{ \eta^{T}M\eta \right\} \text{subject to} A\eta=b<br />
</math><br />
<br />
Here the matrix <math>M=K\otimes L\in \mathbb{R}^{m^{2}\times{m^2}}</math> is<br />
given by the outer product of the constituting kernel matrices,<br />
<math>\eta \in \mathbb{R}^{m^2}</math> is a vectorized version of the<br />
permutation matrix <math>\pi</math>, and the constraints imposed by <math>A</math> and <math>b</math><br />
amount to the polytope constraints imposed by <math>\Pi_{m}</math>.<br />
<br />
===Related Work===<br />
Mutual Information is defined as, <math>I(X,Y)=h(X)+h(Y)-h(X,Y)</math>. We can<br />
approximate MI maximization by maximizing its lower bound. This then<br />
corresponds to minimizing an upper bound on the joint<br />
entropy <math>h(X,Y)</math>.<br />
<br />
Optimization<br />
<br />
<math><br />
\pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|,<br />
</math><br />
<br />
where <math>\ J_{ij}=K_{ij}L_{\pi(i),\pi(j)}</math>. This is related to the<br />
optimization criterion proposed by Jebara(2004) in the context of<br />
aligning bags of observations by sorting via minimum volume PCA.<br />
<br />
===Multivariate Extensions===<br />
Let there be T random variables <math>x_i \in {\mathcal X}_i</math> which are jointly drawn from some distribution <math>p(x_1,...x_m)</math>. The expectation operator with respect to the joint distribution and with respect to the product of the marginals is given by<br />
<br />
<math><br />
\mathbb{E}_{x_1,...,x_T}[\prod_{i=1}^{T}k_{i}(x_{i},\cdot)]</math> and <math>\prod_{i=1}^{T}\mathbb{E}_{x_i}[k_{i}(x_{i},\cdot)]<br />
</math><br />
<br />
respectively. Both terms are equal if and only if all random variables are independent. The squared difference between both is given by<br />
<br />
<math><br />
\mathbb{E}_{x_{i=1}^T,{x'}_{i=1}^{T}}[\prod_{i=1}^{T}k_{i}(x_{i},x_{i}^{'})]+\prod_{i=1}^{T}\mathbb{E}_{x_{i},x_{i}^{'}}[k_{i}(x_{i},x_{i}^{'})]-2\mathbb{E}_{x_{i=1}^{T}}[\prod_{i=1}^{T}\mathbb{E}_{x_{i}^{'}}[k(x_{i},x_{i}^{'})]]<br />
</math><br />
<br />
which we refer to as multiway HSIC.<br />
<br />
Denote by <math>K_{i}</math> the kernel matrix obtained from the kernel <math>k_{i}</math> on the set of observations <math>X_{i}:=\{x_{i1},...,x_{im}\}</math>, the empirical estimate is given by<br />
<br />
<math><br />
HSIC[X_{1},...,X_{T}]:=1_{m}^{T}(\bigodot_{i=1}^{T}K_{i})1_{m}+\prod_{i=1}^{T}1_{m}^{T}K_{i}1_{m}-2\cdot1_{m}^{T}(\bigodot_{i=1}^{T}K_{i}1_{m})<br />
</math><br />
<br />
where <math>\bigodot_{t=1}^{T}\ast</math> denotes elementwise product of its arguments. To apply this to sorting we only need to define T permutation matrices <math>\pi_{i} \in \Pi_{m}</math> and replace the kernel matrices <math>K_{i}</math> by <math>\pi_{i}^{T}K_{i}\pi_{i}</math>.<br />
<br />
==Optimization==<br />
===Convex Objective and Convex Domain===<br />
<br />
Define <math>\pi</math> as a doubly stochastic matrix,<br />
<br />
<math><br />
P_{m}:=\{\pi \in \mathbb{R}^{m \times m} where<br />
\pi_{ij}\geqslant 0 and \sum_{i}\pi_{ij}=1 and \sum_{j}\pi_{ij}=1\}<br />
</math><br />
<br />
The objective function <br />
<math>tr K<br />
\pi^{T}L\pi</math> is convex in <math>\pi</math> .<br />
<br />
===Convex-Concave Procedure===<br />
<br />
Compute successive linear lower bounds and maximize<br />
<math><br />
\pi_{i+1}\leftarrow \arg\max_{\pi \in P_{m}}[tr<br />
\bar{K} \pi^{T}\bar{L} \pi_{i}]<br />
</math><br />
<br />
This will converge to a local maximum.<br />
<br />
Initialization is done via sorted principal eigenvector.<br />
<br />
===An tentative explanation for this part===<br />
Basically, I think the optimizing method used in this paper does not apply the Concave Convex Procedure exactly. As I said on Tuesday, I think it just "borrowed" the idea form the Concave Convex Procedure since there is no concave part in this question.<br />
<br />
Accoding to the paper, the Concave Convex Procedure works as<br />
follows: <math>f(x)=g(x)-h(x)</math>, where <math>g</math> is convex and <math>h</math> is concave, a<br />
lower bound can be found by<br />
<br />
<math><br />
f(x) \ge g(x_{0}) + \langle x-x_{0},\partial_{x}g(x_{0}) \rangle<br />
-h(x)<br />
</math><br />
<br />
For the objecitve function in the kernelized sorting method, it can be written in the following format<br />
<br />
<math>f(\pi)=g(\pi)= tr \bar{K}<br />
\pi^{T}\bar{L}\pi</math><br />
<br />
Currently, suppose we have <math>\pi_{0}</math> and <math>g(\pi_{0}) = tr\bar{K} \pi_{0}^{T}\bar{L}\pi_{0}</math>.<br />
<br />
We know that <math>\bigtriangledown_{A} tr ABA^{T}C=CAB+C^{T}AB^{T}</math>.<br />
<br />
So <math> \bigtriangledown_{\pi} tr \bar K<br />
\pi^{T} \bar L \pi=\bigtriangledown_{\pi} tr \pi \bar K \pi^{T} \bar L = \bar L \pi \bar K+\bar L^{T} \pi \bar K^{T}</math>.<br />
Since <math>\bar K</math> and <math>\bar L</math> are symmetric matrix, we get<br />
<br />
<math>\bigtriangledown_{\pi} tr \pi \bar<br />
K \pi^{T} \bar L = 2\bar L \pi \bar K</math>.<br />
<br />
Hence, we get<br />
<math><br />
\langle \pi - \pi_{0}, \bigtriangledown_{\pi} tr \pi_{0} \bar K \pi_{0}^{T} \bar L\rangle <br />
=\langle \pi - \pi_{0}, 2\bar L \pi_{0} \bar K\rangle<br />
=2tr (\pi - \pi_{0})^{T}\bar L \pi_{0} \bar K<br />
=2tr \bar K(\pi - \pi_{0})^{T}\bar L \pi_{0}<br />
</math><br />
<br />
In this case, we can get<br />
<math><br />
f(\pi)\ge tr <br />
\bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + 2tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
We can drop the coefficient <math>2</math> since we just want to maximize this<br />
lower bound, so we get<br />
<math>f(\pi)\ge tr \bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
Hence, we get<br />
<math>f(\pi)\ge tr <br />
\bar{K} \pi^{T}\bar{L}\pi_{0} </math><br />
<br />
So this is the lower bound we want to update repeatedly.<br />
<br />
Actually, I think if the Kernel matrices <math>K</math> and <math>L</math> are well defined and computed already, there is no too much parameters in this optimization problem, which means some stochastic gradient descent method can be applied. In fact, I think this question is easier than the assignment problem based the same argument about the complexity of parameters. Hence, interpreting it as a TSP problem and for example, using Simulated Annealing algorithm, is an acceptable method.<br />
<br />
== Application ==<br />
<br />
Assume that we may want to visualize data according to the metric structure inherent in it. More specifically, our objective is to align it <br />
according to a given template, such as a grid, a torus, or any other fixed structure. Such problems occur when presenting images or documents to a user. Most of the algorithms for low dimensional object layout suffer from the problem that the low dimensional presentation is nonuniform. This has the advantage of revealing cluster structure but given limited screen size the presentation is undesirable. To address this problem, we can use kernelized sorting to align objects. In this scenario the kernel matrix L is given by the similarity measure between the objects <math> x_i </math> that are to be aligned. The kernel K, on the other hand, denotes the similarity between the locations where objects are to be aligned to. For the sake of simplicity we used a Gaussian RBF kernel between the objects to laid out and also between the positions of the grid, i.e. <math>\mathbf{k(x,x') = exp(-gamma ||x-x'||^2) }</math>. The kernel width <math>\mathbf{\gamma }</math> was adjusted to the inverse of <math> \mathbf{||x-x'||^2} </math> such that the argument of the exponential is O(1).<br />
<br />
We obtained 284 images from http://www.flickr.com which were resized and downsampled to 40*40 pixels. We converted the images from RGB into Lab color space, yielding 40*40*3 dimensional objects. The grid, corresponding to X is a ‘NIPS 2008’ letters on which the images <br />
are to be laid out. After sorting we display the images according to their matching coordinates (Figure 1). We can see images with similar color composition are found at proximal locations.<br />
<br />
<center>[[File:Kernelized Sorting-Fig1.JPG]]</center><br />
<br />
==Summary==<br />
<br />
We generalize sorting by maximizing dependency between matched pairs of observations via HSIC.<br />
<br />
Applications of our proposed sorting algorithm range from data visualization to image, data attribute and multilingual document matching.</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Kernelized_Sorting-Fig1.JPG&diff=3832File:Kernelized Sorting-Fig1.JPG2009-08-05T03:23:39Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=kernelized_Sorting&diff=3802kernelized Sorting2009-08-03T00:38:14Z<p>Amir: /* Application */</p>
<hr />
<div>Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. <br />
<br />
==Introduction==<br />
===Problem Statement===<br />
Assume we are given two collections of documents purportedly covering the same content, written in two different languages. Can we determine the correspondence between these two sets of documents without using a dictionary?<br />
<br />
===Sorting and Matching===<br />
(Formal) problem formulation:<br />
<br />
Given two sets of observations <math> X= \{ x_{1},...,<br />
x_{m} \}\subseteq \mathcal X</math> and <math>Y=\{ y_{1},..., y_{m}\}\subseteq \mathcal Y </math><br />
<br />
Find a permutation matrix <math>\pi \in \Pi_{m}</math>,<br />
<br />
<math> \Pi_{m}:= \{ \pi | \pi \in \{0,1\}^{m \times m} where<br />
\pi 1_{m}=1_{m}, <br />
\pi^{T}1_{m}=1_{m}\}</math><br />
<br />
such that <math> \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math> is maximally dependent. Here <math>1_{m} \in \mathbb{R}^{m}</math> is the<br />
vector of all ones.<br />
<br />
Denote by <math>D(Z(\pi))</math> a measure of the dependence between x and y, where <math> Z(\pi) := \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math>. <br />
<br />
Then we define nonparametric sorting of X and Y as follows<br />
<br />
<math><br />
\pi^{\ast}:=\arg\max_{\pi \in \prod_{m}}D(Z(\pi)).<br />
</math><br />
<br />
==Hilbert Schmidt Independence Criterion==<br />
<br />
Let sets of observations X and Y be drawn jointly from some probability distribution <math>Pr_{xy}</math>. The Hilbert Schmidt Independence Criterion (HSIC) measures the dependence between x and y by computing the norm of the cross-covariance operator over the domain <math> \mathcal X \times \mathcal Y</math> in Hilbert Space.<br />
<br />
let <math>\mathcal {F}</math> be the Reproducing Kernel Hilbert Space (RKHS) on<br />
<math>\mathcal {X}</math> with associated kernel <math>k: \mathcal X \times \mathcal X \rightarrow<br />
\mathbb{R}</math> and feature map <math>\phi: \mathcal X \rightarrow \mathcal {F}</math>.<br />
Let <math>\mathcal {G}</math> be the RKHS on <math>\mathcal Y</math> with kernel <math>l</math> and<br />
feature map <math>\psi</math>. The cross-covariance operator <math>C_{xy}:\mathcal<br />
{G}\rightarrow \mathcal {F}</math> is defined by<br />
<br />
<math><br />
C_{xy}=\mathbb{E}_{xy}[(\phi(x)-\mu_{x})\otimes (\psi(y)-\mu_{y})],<br />
</math><br />
<br />
where <math>\mu_{x}=\mathbb{E}[\phi(x)]</math>, <math>\mu_{y}=\mathbb{E}[\psi(y)]</math>.<br />
<br />
HSIC is the square of the Hilbert-Schmidt norm of the cross covariance operator <math>\, C_{xy}</math><br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Pr_{xy}):=\parallel C_{xy}<br />
\parallel_{HS}^{2}.<br />
</math><br />
<br />
In term of kernels, HSIC can be expressed as<br />
<br />
<math><br />
\mathbb{E}_{xx'yy'}[k(x,x')l(y,y')]+\mathbb{E}_{xx'}[k(x,x')]\mathbb{E}_{yy'}[l(y,y')]-2\mathbb{E}_{xy}[\mathbb{E}_{x'}[k(x,x')]\mathbb{E}_{y}[l(y,y')]].<br />
</math><br />
<br />
where <math>\mathbb{E}_{xx'yy'}</math> is the expectation over both <math>\ (x, y)</math> ~<br />
<math>\ Pr_{xy}</math> and an additional pair of variables <math>\ (x', y')</math> ~ <math>\ Pr_{xy}</math><br />
drawn independently according to the same law.<br />
<br />
A biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, <math>H_{ij}=\delta_{ij}-m^{-1}</math><br />
centers the data and the labels in feature space, <math>\bar{K}:=HKH</math> and<br />
<math>\bar{L}:=HLH</math> denote the centered versions <math>K</math> and <math>L</math> respectively.<br />
<br />
Advantages of HSIC are:<br />
<br />
Computing HSIC is simple: only the kernel matrices K and L are needed;<br />
<br />
HSIC satisfies concentration of measure conditions, i.e. for random draws of observation from <math>Pr_{xy}</math>, HSIC provides values which are very similar;<br />
<br />
Incorporating prior knowledge into the dependence estimation can be done via<br />
kernels.<br />
<br />
==Kernelized Sorting==<br />
===Kernelized Sorting===<br />
'''Claim: ''' Thr problem is equivalent to the optimization problem of <br />
<math><br />
\pi^{\ast}=\arg\max_{\pi \in \Pi_{m}}[tr \bar{K}<br />
\pi^{T}\bar{L}\pi]<br />
</math><br />
<br />
'''Proof''': Firstly, we need to establish <math>H</math> and <math>\pi</math> matrices commute.<br />
<br />
Since <math>H</math> is a centering matrix, we can write it as <math>H=I_{n}-11^{T}</math>.<br />
<br />
Actually, note that <math>\ H\pi=\pi H</math> iff <math>\ (I_{n}-11^{T})\pi=\pi (I_{n}-11^{T})</math> iff <math>\ 11^{T}\pi=\pi 11^{T}</math>, a result follows.<br />
<br />
Next, recall that the biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, i.e. <math>K=xx^{T}</math> and <math>L=yy^{T}</math>.<br />
<br />
Now, for any given pair <math>(x, y_{r})</math> between <math>X</math> and <math>Y</math>, we have <math>y_{r}=\pi y</math>.<br />
<br />
Note that <math>\pi</math> is a permutation matrix, we have <math>y=\pi^{T} y_{r}</math>, so the kernel matrix <math>L=\pi^{T}y_{r}y_{r}^{T}\pi</math>.<br />
<br />
Note that the kernel matrix <math>L_{r}=y_{r}y_{r}^{T}</math>, so the kernel matrix <math>L=\pi^{T}L_{r}\pi</math>.<br />
<br />
Note that <math>tr HKHL = tr HKHHLH </math>, since <math>H</math> is idempotent.<br />
<br />
So we have <math>tr HKHL = tr HKHHLH = tr \bar K H\pi^{T}L_{r}\pi H = tr \bar K \pi^{T}HL_{r}H\pi = tr \bar K \pi^{T}\bar L_{r}\pi </math>. <br />
<br />
Clearly, it is just our objective function.<br />
<br />
====Sorting as a special case====<br />
For general kernel matrices <math>K \,</math> and <math>L \,</math>, where <math>K_{ij}=k(x_i,x_j) \,</math> and <math>L_{ij}=l(x_i,x_j) \,</math>, the objective of the kernelized sorting problem, as explained above, is to find the permutation matrix <math> \pi \,</math> which maximizes <math>tr(\bar{K} \pi^{T}\bar{L}\pi ) = tr(HKH\pi^{T}HLH\pi)\, </math>.<br />
<br />
In the special case where the kernel functions <math>k\,</math> and <math>l\,</math> are the inner product in Euclidean space, we have <math>K=xx^{T}\,</math> and <math>L=yy^{T}\,</math>. Hence, we can rewrite the objective as <br />
<br />
<math>tr(HKH\pi^{T}HLH\pi) = tr(Hxx^{T}H\pi^{T}Hyy^{T}H\pi) = tr[Hx(Hx)^T\pi^{T}Hy(Hy)^T\pi] = tr[((Hx)^T\pi^{T}Hy) ((Hy)^T\pi Hx))]\,</math>, where the last step uses the property that trace is invariant under cyclic permutations.<br />
<br />
Note that <math>(Hx)^T\pi^{T}Hy \, </math> and <math> (Hy)^T\pi Hx = (Hx)^T\pi^{T}Hy \,</math> are scalars, therefore the objective is equal to <math> [(Hx)^T\pi (Hy)]^2 \,</math>.<br />
<br />
In the even more special case where the Euclidean space is the real line and the inner product is multiplication of real numbers, the centering matrix <math>H\,</math> merely translates the sample vector <math>y \,</math> (by the sample mean) and thus the order of <math>y \,</math> is preserved. Hence, maximizing <math> [(Hx)^T\pi (Hy)]^2 \,</math> can be solved by maximizing <math>x^T \pi y \,</math>. Under the further assumption that <math>x \,</math> is sorted ascendingly, maximizing <math> x^T \pi y \,</math> is equivalent to sorting <math>y \,</math> ascendingly, according to the Polya-Littlewood-Hardy inequality.<br />
<br />
===Diagonal Dominance===<br />
Replace the expectations by sums where no pairwise summation indices are identical. This leads to the objective function:<br />
<br />
<math><br />
\frac{1}{m(m-1)}\sum_{i\ne<br />
j}K_{ij}L_{ij}+\frac{1}{m^{2}(m-1)^{2}}\sum_{i\ne j,u\ne<br />
v}K_{ij}L_{uv}- \frac{2}{m(m-1)^2}\sum_{i,j\ne i,v\ne i}K_{ij}L_{iv}<br />
</math><br />
<br />
Using the <math>\bar{K}_{ij}=K_{ij}(1-\delta_{ij})</math> and<br />
<math>\bar{L}_{ij}=L_{ij}(1-\delta_{ij})</math> for kernel matrices where<br />
the main diagonal terms have been removed we arrive at the<br />
expression <math>(m-1)^{-1}tr<br />
H\bar{L}H\bar{K}</math>.<br />
<br />
===Relaxation to a constrained eigenvalue problem===<br />
An approximate solution of the problem by solving <br><br />
<br />
<math><br />
\text{maximize}_{\eta} \left\{ \eta^{T}M\eta \right\} \text{subject to} A\eta=b<br />
</math><br />
<br />
Here the matrix <math>M=K\otimes L\in \mathbb{R}^{m^{2}\times{m^2}}</math> is<br />
given by the outer product of the constituting kernel matrices,<br />
<math>\eta \in \mathbb{R}^{m^2}</math> is a vectorized version of the<br />
permutation matrix <math>\pi</math>, and the constraints imposed by <math>A</math> and <math>b</math><br />
amount to the polytope constraints imposed by <math>\Pi_{m}</math>.<br />
<br />
===Related Work===<br />
Mutual Information is defined as, <math>I(X,Y)=h(X)+h(Y)-h(X,Y)</math>. We can<br />
approximate MI maximization by maximizing its lower bound. This then<br />
corresponds to minimizing an upper bound on the joint<br />
entropy <math>h(X,Y)</math>.<br />
<br />
Optimization<br />
<br />
<math><br />
\pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|,<br />
</math><br />
<br />
where <math>\ J_{ij}=K_{ij}L_{\pi(i),\pi(j)}</math>. This is related to the<br />
optimization criterion proposed by Jebara(2004) in the context of<br />
aligning bags of observations by sorting via minimum volume PCA.<br />
<br />
===Multivariate Extensions===<br />
Let there be T random variables <math>x_i \in {\mathcal X}_i</math> which are jointly drawn from some distribution <math>p(x_1,...x_m)</math>. The expectation operator with respect to the joint distribution and with respect to the product of the marginals is given by<br />
<br />
<math><br />
\mathbb{E}_{x_1,...,x_T}[\prod_{i=1}^{T}k_{i}(x_{i},\cdot)]</math> and <math>\prod_{i=1}^{T}\mathbb{E}_{x_i}[k_{i}(x_{i},\cdot)]<br />
</math><br />
<br />
respectively. Both terms are equal if and only if all random variables are independent. The squared difference between both is given by<br />
<br />
<math><br />
\mathbb{E}_{x_{i=1}^T,{x'}_{i=1}^{T}}[\prod_{i=1}^{T}k_{i}(x_{i},x_{i}^{'})]+\prod_{i=1}^{T}\mathbb{E}_{x_{i},x_{i}^{'}}[k_{i}(x_{i},x_{i}^{'})]-2\mathbb{E}_{x_{i=1}^{T}}[\prod_{i=1}^{T}\mathbb{E}_{x_{i}^{'}}[k(x_{i},x_{i}^{'})]]<br />
</math><br />
<br />
which we refer to as multiway HSIC.<br />
<br />
Denote by <math>K_{i}</math> the kernel matrix obtained from the kernel <math>k_{i}</math> on the set of observations <math>X_{i}:=\{x_{i1},...,x_{im}\}</math>, the empirical estimate is given by<br />
<br />
<math><br />
HSIC[X_{1},...,X_{T}]:=1_{m}^{T}(\bigodot_{i=1}^{T}K_{i})1_{m}+\prod_{i=1}^{T}1_{m}^{T}K_{i}1_{m}-2\cdot1_{m}^{T}(\bigodot_{i=1}^{T}K_{i}1_{m})<br />
</math><br />
<br />
where <math>\bigodot_{t=1}^{T}\ast</math> denotes elementwise product of its arguments. To apply this to sorting we only need to define T permutation matrices <math>\pi_{i} \in \Pi_{m}</math> and replace the kernel matrices <math>K_{i}</math> by <math>\pi_{i}^{T}K_{i}\pi_{i}</math>.<br />
<br />
==Optimization==<br />
===Convex Objective and Convex Domain===<br />
<br />
Define <math>\pi</math> as a doubly stochastic matrix,<br />
<br />
<math><br />
P_{m}:=\{\pi \in \mathbb{R}^{m \times m} where<br />
\pi_{ij}\geqslant 0 and \sum_{i}\pi_{ij}=1 and \sum_{j}\pi_{ij}=1\}<br />
</math><br />
<br />
The objective function <br />
<math>tr K<br />
\pi^{T}L\pi</math> is convex in <math>\pi</math> .<br />
<br />
===Convex-Concave Procedure===<br />
<br />
Compute successive linear lower bounds and maximize<br />
<math><br />
\pi_{i+1}\leftarrow \arg\max_{\pi \in P_{m}}[tr<br />
\bar{K} \pi^{T}\bar{L} \pi_{i}]<br />
</math><br />
<br />
This will converge to a local maximum.<br />
<br />
Initialization is done via sorted principal eigenvector.<br />
<br />
===An tentative explanation for this part===<br />
Basically, I think the optimizing method used in this paper does not apply the Concave Convex Procedure exactly. As I said on Tuesday, I think it just "borrowed" the idea form the Concave Convex Procedure since there is no concave part in this question.<br />
<br />
Accoding to the paper, the Concave Convex Procedure works as<br />
follows: <math>f(x)=g(x)-h(x)</math>, where <math>g</math> is convex and <math>h</math> is concave, a<br />
lower bound can be found by<br />
<br />
<math><br />
f(x) \ge g(x_{0}) + \langle x-x_{0},\partial_{x}g(x_{0}) \rangle<br />
-h(x)<br />
</math><br />
<br />
For the objecitve function in the kernelized sorting method, it can be written in the following format<br />
<br />
<math>f(\pi)=g(\pi)= tr \bar{K}<br />
\pi^{T}\bar{L}\pi</math><br />
<br />
Currently, suppose we have <math>\pi_{0}</math> and <math>g(\pi_{0}) = tr\bar{K} \pi_{0}^{T}\bar{L}\pi_{0}</math>.<br />
<br />
We know that <math>\bigtriangledown_{A} tr ABA^{T}C=CAB+C^{T}AB^{T}</math>.<br />
<br />
So <math> \bigtriangledown_{\pi} tr \bar K<br />
\pi^{T} \bar L \pi=\bigtriangledown_{\pi} tr \pi \bar K \pi^{T} \bar L = \bar L \pi \bar K+\bar L^{T} \pi \bar K^{T}</math>.<br />
Since <math>\bar K</math> and <math>\bar L</math> are symmetric matrix, we get<br />
<br />
<math>\bigtriangledown_{\pi} tr \pi \bar<br />
K \pi^{T} \bar L = 2\bar L \pi \bar K</math>.<br />
<br />
Hence, we get<br />
<math><br />
\langle \pi - \pi_{0}, \bigtriangledown_{\pi} tr \pi_{0} \bar K \pi_{0}^{T} \bar L\rangle <br />
=\langle \pi - \pi_{0}, 2\bar L \pi_{0} \bar K\rangle<br />
=2tr (\pi - \pi_{0})^{T}\bar L \pi_{0} \bar K<br />
=2tr \bar K(\pi - \pi_{0})^{T}\bar L \pi_{0}<br />
</math><br />
<br />
In this case, we can get<br />
<math><br />
f(\pi)\ge tr <br />
\bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + 2tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
We can drop the coefficient <math>2</math> since we just want to maximize this<br />
lower bound, so we get<br />
<math>f(\pi)\ge tr \bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
Hence, we get<br />
<math>f(\pi)\ge tr <br />
\bar{K} \pi^{T}\bar{L}\pi_{0} </math><br />
<br />
So this is the lower bound we want to update repeatedly.<br />
<br />
Actually, I think if the Kernel matrices <math>K</math> and <math>L</math> are well defined and computed already, there is no too much parameters in this optimization problem, which means some stochastic gradient descent method can be applied. In fact, I think this question is easier than the assignment problem based the same argument about the complexity of parameters. Hence, interpreting it as a TSP problem and for example, using Simulated Annealing algorithm, is an acceptable method.<br />
<br />
== Application ==<br />
<br />
Assume that we may want to visualize data according to the metric structure inherent in it. More specifically, our objective is to align it <br />
according to a given template, such as a grid, a torus, or any other fixed structure. Such problems occur when presenting images or documents to a user. Most of the algorithms for low dimensional object layout suffer from the problem that the low dimensional presentation is nonuniform. This has the advantage of revealing cluster structure but given limited screen size the presentation is undesirable. To address this problem, we can use kernelized sorting to align objects. In this scenario the kernel matrix L is given by the similarity measure between the objects <math> x_i </math> that are to be aligned. The kernel K, on the other hand, denotes the similarity between the locations where objects are to be aligned to. For the sake of simplicity we used a Gaussian RBF kernel between the objects to laid out and also between the positions of the grid, i.e. <math>\mathbf{k(x,x') = exp(-gamma ||x-x'||^2) }</math>. The kernel width <math>\mathbf{\gamma }</math> was adjusted to the inverse of <math> \mathbf{||x-x'||^2} </math> such that the argument of the exponential is O(1).<br />
<br />
We obtained 284 images from http://www.flickr.com which were resized and downsampled to 40*40 pixels. We converted the images from RGB into Lab color space, yielding 40*40*3 dimensional objects.<br />
<br />
==Summary==<br />
<br />
We generalize sorting by maximizing dependency between matched pairs of observations via HSIC.<br />
<br />
Applications of our proposed sorting algorithm range from data visualization to image, data attribute and multilingual document matching.</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=kernelized_Sorting&diff=3801kernelized Sorting2009-08-03T00:29:59Z<p>Amir: /* Application */</p>
<hr />
<div>Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. <br />
<br />
==Introduction==<br />
===Problem Statement===<br />
Assume we are given two collections of documents purportedly covering the same content, written in two different languages. Can we determine the correspondence between these two sets of documents without using a dictionary?<br />
<br />
===Sorting and Matching===<br />
(Formal) problem formulation:<br />
<br />
Given two sets of observations <math> X= \{ x_{1},...,<br />
x_{m} \}\subseteq \mathcal X</math> and <math>Y=\{ y_{1},..., y_{m}\}\subseteq \mathcal Y </math><br />
<br />
Find a permutation matrix <math>\pi \in \Pi_{m}</math>,<br />
<br />
<math> \Pi_{m}:= \{ \pi | \pi \in \{0,1\}^{m \times m} where<br />
\pi 1_{m}=1_{m}, <br />
\pi^{T}1_{m}=1_{m}\}</math><br />
<br />
such that <math> \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math> is maximally dependent. Here <math>1_{m} \in \mathbb{R}^{m}</math> is the<br />
vector of all ones.<br />
<br />
Denote by <math>D(Z(\pi))</math> a measure of the dependence between x and y, where <math> Z(\pi) := \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math>. <br />
<br />
Then we define nonparametric sorting of X and Y as follows<br />
<br />
<math><br />
\pi^{\ast}:=\arg\max_{\pi \in \prod_{m}}D(Z(\pi)).<br />
</math><br />
<br />
==Hilbert Schmidt Independence Criterion==<br />
<br />
Let sets of observations X and Y be drawn jointly from some probability distribution <math>Pr_{xy}</math>. The Hilbert Schmidt Independence Criterion (HSIC) measures the dependence between x and y by computing the norm of the cross-covariance operator over the domain <math> \mathcal X \times \mathcal Y</math> in Hilbert Space.<br />
<br />
let <math>\mathcal {F}</math> be the Reproducing Kernel Hilbert Space (RKHS) on<br />
<math>\mathcal {X}</math> with associated kernel <math>k: \mathcal X \times \mathcal X \rightarrow<br />
\mathbb{R}</math> and feature map <math>\phi: \mathcal X \rightarrow \mathcal {F}</math>.<br />
Let <math>\mathcal {G}</math> be the RKHS on <math>\mathcal Y</math> with kernel <math>l</math> and<br />
feature map <math>\psi</math>. The cross-covariance operator <math>C_{xy}:\mathcal<br />
{G}\rightarrow \mathcal {F}</math> is defined by<br />
<br />
<math><br />
C_{xy}=\mathbb{E}_{xy}[(\phi(x)-\mu_{x})\otimes (\psi(y)-\mu_{y})],<br />
</math><br />
<br />
where <math>\mu_{x}=\mathbb{E}[\phi(x)]</math>, <math>\mu_{y}=\mathbb{E}[\psi(y)]</math>.<br />
<br />
HSIC is the square of the Hilbert-Schmidt norm of the cross covariance operator <math>\, C_{xy}</math><br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Pr_{xy}):=\parallel C_{xy}<br />
\parallel_{HS}^{2}.<br />
</math><br />
<br />
In term of kernels, HSIC can be expressed as<br />
<br />
<math><br />
\mathbb{E}_{xx'yy'}[k(x,x')l(y,y')]+\mathbb{E}_{xx'}[k(x,x')]\mathbb{E}_{yy'}[l(y,y')]-2\mathbb{E}_{xy}[\mathbb{E}_{x'}[k(x,x')]\mathbb{E}_{y}[l(y,y')]].<br />
</math><br />
<br />
where <math>\mathbb{E}_{xx'yy'}</math> is the expectation over both <math>\ (x, y)</math> ~<br />
<math>\ Pr_{xy}</math> and an additional pair of variables <math>\ (x', y')</math> ~ <math>\ Pr_{xy}</math><br />
drawn independently according to the same law.<br />
<br />
A biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, <math>H_{ij}=\delta_{ij}-m^{-1}</math><br />
centers the data and the labels in feature space, <math>\bar{K}:=HKH</math> and<br />
<math>\bar{L}:=HLH</math> denote the centered versions <math>K</math> and <math>L</math> respectively.<br />
<br />
Advantages of HSIC are:<br />
<br />
Computing HSIC is simple: only the kernel matrices K and L are needed;<br />
<br />
HSIC satisfies concentration of measure conditions, i.e. for random draws of observation from <math>Pr_{xy}</math>, HSIC provides values which are very similar;<br />
<br />
Incorporating prior knowledge into the dependence estimation can be done via<br />
kernels.<br />
<br />
==Kernelized Sorting==<br />
===Kernelized Sorting===<br />
'''Claim: ''' Thr problem is equivalent to the optimization problem of <br />
<math><br />
\pi^{\ast}=\arg\max_{\pi \in \Pi_{m}}[tr \bar{K}<br />
\pi^{T}\bar{L}\pi]<br />
</math><br />
<br />
'''Proof''': Firstly, we need to establish <math>H</math> and <math>\pi</math> matrices commute.<br />
<br />
Since <math>H</math> is a centering matrix, we can write it as <math>H=I_{n}-11^{T}</math>.<br />
<br />
Actually, note that <math>\ H\pi=\pi H</math> iff <math>\ (I_{n}-11^{T})\pi=\pi (I_{n}-11^{T})</math> iff <math>\ 11^{T}\pi=\pi 11^{T}</math>, a result follows.<br />
<br />
Next, recall that the biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, i.e. <math>K=xx^{T}</math> and <math>L=yy^{T}</math>.<br />
<br />
Now, for any given pair <math>(x, y_{r})</math> between <math>X</math> and <math>Y</math>, we have <math>y_{r}=\pi y</math>.<br />
<br />
Note that <math>\pi</math> is a permutation matrix, we have <math>y=\pi^{T} y_{r}</math>, so the kernel matrix <math>L=\pi^{T}y_{r}y_{r}^{T}\pi</math>.<br />
<br />
Note that the kernel matrix <math>L_{r}=y_{r}y_{r}^{T}</math>, so the kernel matrix <math>L=\pi^{T}L_{r}\pi</math>.<br />
<br />
Note that <math>tr HKHL = tr HKHHLH </math>, since <math>H</math> is idempotent.<br />
<br />
So we have <math>tr HKHL = tr HKHHLH = tr \bar K H\pi^{T}L_{r}\pi H = tr \bar K \pi^{T}HL_{r}H\pi = tr \bar K \pi^{T}\bar L_{r}\pi </math>. <br />
<br />
Clearly, it is just our objective function.<br />
<br />
====Sorting as a special case====<br />
For general kernel matrices <math>K \,</math> and <math>L \,</math>, where <math>K_{ij}=k(x_i,x_j) \,</math> and <math>L_{ij}=l(x_i,x_j) \,</math>, the objective of the kernelized sorting problem, as explained above, is to find the permutation matrix <math> \pi \,</math> which maximizes <math>tr(\bar{K} \pi^{T}\bar{L}\pi ) = tr(HKH\pi^{T}HLH\pi)\, </math>.<br />
<br />
In the special case where the kernel functions <math>k\,</math> and <math>l\,</math> are the inner product in Euclidean space, we have <math>K=xx^{T}\,</math> and <math>L=yy^{T}\,</math>. Hence, we can rewrite the objective as <br />
<br />
<math>tr(HKH\pi^{T}HLH\pi) = tr(Hxx^{T}H\pi^{T}Hyy^{T}H\pi) = tr[Hx(Hx)^T\pi^{T}Hy(Hy)^T\pi] = tr[((Hx)^T\pi^{T}Hy) ((Hy)^T\pi Hx))]\,</math>, where the last step uses the property that trace is invariant under cyclic permutations.<br />
<br />
Note that <math>(Hx)^T\pi^{T}Hy \, </math> and <math> (Hy)^T\pi Hx = (Hx)^T\pi^{T}Hy \,</math> are scalars, therefore the objective is equal to <math> [(Hx)^T\pi (Hy)]^2 \,</math>.<br />
<br />
In the even more special case where the Euclidean space is the real line and the inner product is multiplication of real numbers, the centering matrix <math>H\,</math> merely translates the sample vector <math>y \,</math> (by the sample mean) and thus the order of <math>y \,</math> is preserved. Hence, maximizing <math> [(Hx)^T\pi (Hy)]^2 \,</math> can be solved by maximizing <math>x^T \pi y \,</math>. Under the further assumption that <math>x \,</math> is sorted ascendingly, maximizing <math> x^T \pi y \,</math> is equivalent to sorting <math>y \,</math> ascendingly, according to the Polya-Littlewood-Hardy inequality.<br />
<br />
===Diagonal Dominance===<br />
Replace the expectations by sums where no pairwise summation indices are identical. This leads to the objective function:<br />
<br />
<math><br />
\frac{1}{m(m-1)}\sum_{i\ne<br />
j}K_{ij}L_{ij}+\frac{1}{m^{2}(m-1)^{2}}\sum_{i\ne j,u\ne<br />
v}K_{ij}L_{uv}- \frac{2}{m(m-1)^2}\sum_{i,j\ne i,v\ne i}K_{ij}L_{iv}<br />
</math><br />
<br />
Using the <math>\bar{K}_{ij}=K_{ij}(1-\delta_{ij})</math> and<br />
<math>\bar{L}_{ij}=L_{ij}(1-\delta_{ij})</math> for kernel matrices where<br />
the main diagonal terms have been removed we arrive at the<br />
expression <math>(m-1)^{-1}tr<br />
H\bar{L}H\bar{K}</math>.<br />
<br />
===Relaxation to a constrained eigenvalue problem===<br />
An approximate solution of the problem by solving <br><br />
<br />
<math><br />
\text{maximize}_{\eta} \left\{ \eta^{T}M\eta \right\} \text{subject to} A\eta=b<br />
</math><br />
<br />
Here the matrix <math>M=K\otimes L\in \mathbb{R}^{m^{2}\times{m^2}}</math> is<br />
given by the outer product of the constituting kernel matrices,<br />
<math>\eta \in \mathbb{R}^{m^2}</math> is a vectorized version of the<br />
permutation matrix <math>\pi</math>, and the constraints imposed by <math>A</math> and <math>b</math><br />
amount to the polytope constraints imposed by <math>\Pi_{m}</math>.<br />
<br />
===Related Work===<br />
Mutual Information is defined as, <math>I(X,Y)=h(X)+h(Y)-h(X,Y)</math>. We can<br />
approximate MI maximization by maximizing its lower bound. This then<br />
corresponds to minimizing an upper bound on the joint<br />
entropy <math>h(X,Y)</math>.<br />
<br />
Optimization<br />
<br />
<math><br />
\pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|,<br />
</math><br />
<br />
where <math>\ J_{ij}=K_{ij}L_{\pi(i),\pi(j)}</math>. This is related to the<br />
optimization criterion proposed by Jebara(2004) in the context of<br />
aligning bags of observations by sorting via minimum volume PCA.<br />
<br />
===Multivariate Extensions===<br />
Let there be T random variables <math>x_i \in {\mathcal X}_i</math> which are jointly drawn from some distribution <math>p(x_1,...x_m)</math>. The expectation operator with respect to the joint distribution and with respect to the product of the marginals is given by<br />
<br />
<math><br />
\mathbb{E}_{x_1,...,x_T}[\prod_{i=1}^{T}k_{i}(x_{i},\cdot)]</math> and <math>\prod_{i=1}^{T}\mathbb{E}_{x_i}[k_{i}(x_{i},\cdot)]<br />
</math><br />
<br />
respectively. Both terms are equal if and only if all random variables are independent. The squared difference between both is given by<br />
<br />
<math><br />
\mathbb{E}_{x_{i=1}^T,{x'}_{i=1}^{T}}[\prod_{i=1}^{T}k_{i}(x_{i},x_{i}^{'})]+\prod_{i=1}^{T}\mathbb{E}_{x_{i},x_{i}^{'}}[k_{i}(x_{i},x_{i}^{'})]-2\mathbb{E}_{x_{i=1}^{T}}[\prod_{i=1}^{T}\mathbb{E}_{x_{i}^{'}}[k(x_{i},x_{i}^{'})]]<br />
</math><br />
<br />
which we refer to as multiway HSIC.<br />
<br />
Denote by <math>K_{i}</math> the kernel matrix obtained from the kernel <math>k_{i}</math> on the set of observations <math>X_{i}:=\{x_{i1},...,x_{im}\}</math>, the empirical estimate is given by<br />
<br />
<math><br />
HSIC[X_{1},...,X_{T}]:=1_{m}^{T}(\bigodot_{i=1}^{T}K_{i})1_{m}+\prod_{i=1}^{T}1_{m}^{T}K_{i}1_{m}-2\cdot1_{m}^{T}(\bigodot_{i=1}^{T}K_{i}1_{m})<br />
</math><br />
<br />
where <math>\bigodot_{t=1}^{T}\ast</math> denotes elementwise product of its arguments. To apply this to sorting we only need to define T permutation matrices <math>\pi_{i} \in \Pi_{m}</math> and replace the kernel matrices <math>K_{i}</math> by <math>\pi_{i}^{T}K_{i}\pi_{i}</math>.<br />
<br />
==Optimization==<br />
===Convex Objective and Convex Domain===<br />
<br />
Define <math>\pi</math> as a doubly stochastic matrix,<br />
<br />
<math><br />
P_{m}:=\{\pi \in \mathbb{R}^{m \times m} where<br />
\pi_{ij}\geqslant 0 and \sum_{i}\pi_{ij}=1 and \sum_{j}\pi_{ij}=1\}<br />
</math><br />
<br />
The objective function <br />
<math>tr K<br />
\pi^{T}L\pi</math> is convex in <math>\pi</math> .<br />
<br />
===Convex-Concave Procedure===<br />
<br />
Compute successive linear lower bounds and maximize<br />
<math><br />
\pi_{i+1}\leftarrow \arg\max_{\pi \in P_{m}}[tr<br />
\bar{K} \pi^{T}\bar{L} \pi_{i}]<br />
</math><br />
<br />
This will converge to a local maximum.<br />
<br />
Initialization is done via sorted principal eigenvector.<br />
<br />
===An tentative explanation for this part===<br />
Basically, I think the optimizing method used in this paper does not apply the Concave Convex Procedure exactly. As I said on Tuesday, I think it just "borrowed" the idea form the Concave Convex Procedure since there is no concave part in this question.<br />
<br />
Accoding to the paper, the Concave Convex Procedure works as<br />
follows: <math>f(x)=g(x)-h(x)</math>, where <math>g</math> is convex and <math>h</math> is concave, a<br />
lower bound can be found by<br />
<br />
<math><br />
f(x) \ge g(x_{0}) + \langle x-x_{0},\partial_{x}g(x_{0}) \rangle<br />
-h(x)<br />
</math><br />
<br />
For the objecitve function in the kernelized sorting method, it can be written in the following format<br />
<br />
<math>f(\pi)=g(\pi)= tr \bar{K}<br />
\pi^{T}\bar{L}\pi</math><br />
<br />
Currently, suppose we have <math>\pi_{0}</math> and <math>g(\pi_{0}) = tr\bar{K} \pi_{0}^{T}\bar{L}\pi_{0}</math>.<br />
<br />
We know that <math>\bigtriangledown_{A} tr ABA^{T}C=CAB+C^{T}AB^{T}</math>.<br />
<br />
So <math> \bigtriangledown_{\pi} tr \bar K<br />
\pi^{T} \bar L \pi=\bigtriangledown_{\pi} tr \pi \bar K \pi^{T} \bar L = \bar L \pi \bar K+\bar L^{T} \pi \bar K^{T}</math>.<br />
Since <math>\bar K</math> and <math>\bar L</math> are symmetric matrix, we get<br />
<br />
<math>\bigtriangledown_{\pi} tr \pi \bar<br />
K \pi^{T} \bar L = 2\bar L \pi \bar K</math>.<br />
<br />
Hence, we get<br />
<math><br />
\langle \pi - \pi_{0}, \bigtriangledown_{\pi} tr \pi_{0} \bar K \pi_{0}^{T} \bar L\rangle <br />
=\langle \pi - \pi_{0}, 2\bar L \pi_{0} \bar K\rangle<br />
=2tr (\pi - \pi_{0})^{T}\bar L \pi_{0} \bar K<br />
=2tr \bar K(\pi - \pi_{0})^{T}\bar L \pi_{0}<br />
</math><br />
<br />
In this case, we can get<br />
<math><br />
f(\pi)\ge tr <br />
\bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + 2tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
We can drop the coefficient <math>2</math> since we just want to maximize this<br />
lower bound, so we get<br />
<math>f(\pi)\ge tr \bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
Hence, we get<br />
<math>f(\pi)\ge tr <br />
\bar{K} \pi^{T}\bar{L}\pi_{0} </math><br />
<br />
So this is the lower bound we want to update repeatedly.<br />
<br />
Actually, I think if the Kernel matrices <math>K</math> and <math>L</math> are well defined and computed already, there is no too much parameters in this optimization problem, which means some stochastic gradient descent method can be applied. In fact, I think this question is easier than the assignment problem based the same argument about the complexity of parameters. Hence, interpreting it as a TSP problem and for example, using Simulated Annealing algorithm, is an acceptable method.<br />
<br />
== Application ==<br />
<br />
Assume that we may want to visualize data according to the metric structure inherent in it. More specifically, our objective is to align it <br />
according to a given template, such as a grid, a torus, or any other fixed structure. Such problems occur when presenting images or documents to a user. Most of the algorithms for low dimensional object layout suffer from the problem that the low dimensional presentation is nonuniform. This has the advantage of revealing cluster structure but given limited screen size the presentation is undesirable. To address this problem, we can use kernelized sorting to align objects. In this scenario the kernel matrix L is given by the similarity measure between the objects <math> x_i </math> that are to be aligned. The kernel K, on the other hand, denotes the similarity between the locations where objects are to be aligned to. For the sake of simplicity we used a Gaussian RBF kernel between the objects to laid out and also between the positions of the grid, i.e. <math>\mathbf{k(x,x') = exp(-gamma ||x-x'||^2) }</math>. The kernel width <math>\mathbf{\gamma }</math> was adjusted to the inverse of <math> \mathbf{||x-x'||^2} </math> such that the argument of the exponential is O(1).<br />
<br />
==Summary==<br />
<br />
We generalize sorting by maximizing dependency between matched pairs of observations via HSIC.<br />
<br />
Applications of our proposed sorting algorithm range from data visualization to image, data attribute and multilingual document matching.</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=kernelized_Sorting&diff=3800kernelized Sorting2009-08-03T00:18:28Z<p>Amir: /* Application */</p>
<hr />
<div>Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. <br />
<br />
==Introduction==<br />
===Problem Statement===<br />
Assume we are given two collections of documents purportedly covering the same content, written in two different languages. Can we determine the correspondence between these two sets of documents without using a dictionary?<br />
<br />
===Sorting and Matching===<br />
(Formal) problem formulation:<br />
<br />
Given two sets of observations <math> X= \{ x_{1},...,<br />
x_{m} \}\subseteq \mathcal X</math> and <math>Y=\{ y_{1},..., y_{m}\}\subseteq \mathcal Y </math><br />
<br />
Find a permutation matrix <math>\pi \in \Pi_{m}</math>,<br />
<br />
<math> \Pi_{m}:= \{ \pi | \pi \in \{0,1\}^{m \times m} where<br />
\pi 1_{m}=1_{m}, <br />
\pi^{T}1_{m}=1_{m}\}</math><br />
<br />
such that <math> \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math> is maximally dependent. Here <math>1_{m} \in \mathbb{R}^{m}</math> is the<br />
vector of all ones.<br />
<br />
Denote by <math>D(Z(\pi))</math> a measure of the dependence between x and y, where <math> Z(\pi) := \{ (x_{i},y_{\pi (i)}) for 1 \leqslant i \leqslant m \}<br />
</math>. <br />
<br />
Then we define nonparametric sorting of X and Y as follows<br />
<br />
<math><br />
\pi^{\ast}:=\arg\max_{\pi \in \prod_{m}}D(Z(\pi)).<br />
</math><br />
<br />
==Hilbert Schmidt Independence Criterion==<br />
<br />
Let sets of observations X and Y be drawn jointly from some probability distribution <math>Pr_{xy}</math>. The Hilbert Schmidt Independence Criterion (HSIC) measures the dependence between x and y by computing the norm of the cross-covariance operator over the domain <math> \mathcal X \times \mathcal Y</math> in Hilbert Space.<br />
<br />
let <math>\mathcal {F}</math> be the Reproducing Kernel Hilbert Space (RKHS) on<br />
<math>\mathcal {X}</math> with associated kernel <math>k: \mathcal X \times \mathcal X \rightarrow<br />
\mathbb{R}</math> and feature map <math>\phi: \mathcal X \rightarrow \mathcal {F}</math>.<br />
Let <math>\mathcal {G}</math> be the RKHS on <math>\mathcal Y</math> with kernel <math>l</math> and<br />
feature map <math>\psi</math>. The cross-covariance operator <math>C_{xy}:\mathcal<br />
{G}\rightarrow \mathcal {F}</math> is defined by<br />
<br />
<math><br />
C_{xy}=\mathbb{E}_{xy}[(\phi(x)-\mu_{x})\otimes (\psi(y)-\mu_{y})],<br />
</math><br />
<br />
where <math>\mu_{x}=\mathbb{E}[\phi(x)]</math>, <math>\mu_{y}=\mathbb{E}[\psi(y)]</math>.<br />
<br />
HSIC is the square of the Hilbert-Schmidt norm of the cross covariance operator <math>\, C_{xy}</math><br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Pr_{xy}):=\parallel C_{xy}<br />
\parallel_{HS}^{2}.<br />
</math><br />
<br />
In term of kernels, HSIC can be expressed as<br />
<br />
<math><br />
\mathbb{E}_{xx'yy'}[k(x,x')l(y,y')]+\mathbb{E}_{xx'}[k(x,x')]\mathbb{E}_{yy'}[l(y,y')]-2\mathbb{E}_{xy}[\mathbb{E}_{x'}[k(x,x')]\mathbb{E}_{y}[l(y,y')]].<br />
</math><br />
<br />
where <math>\mathbb{E}_{xx'yy'}</math> is the expectation over both <math>\ (x, y)</math> ~<br />
<math>\ Pr_{xy}</math> and an additional pair of variables <math>\ (x', y')</math> ~ <math>\ Pr_{xy}</math><br />
drawn independently according to the same law.<br />
<br />
A biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, <math>H_{ij}=\delta_{ij}-m^{-1}</math><br />
centers the data and the labels in feature space, <math>\bar{K}:=HKH</math> and<br />
<math>\bar{L}:=HLH</math> denote the centered versions <math>K</math> and <math>L</math> respectively.<br />
<br />
Advantages of HSIC are:<br />
<br />
Computing HSIC is simple: only the kernel matrices K and L are needed;<br />
<br />
HSIC satisfies concentration of measure conditions, i.e. for random draws of observation from <math>Pr_{xy}</math>, HSIC provides values which are very similar;<br />
<br />
Incorporating prior knowledge into the dependence estimation can be done via<br />
kernels.<br />
<br />
==Kernelized Sorting==<br />
===Kernelized Sorting===<br />
'''Claim: ''' Thr problem is equivalent to the optimization problem of <br />
<math><br />
\pi^{\ast}=\arg\max_{\pi \in \Pi_{m}}[tr \bar{K}<br />
\pi^{T}\bar{L}\pi]<br />
</math><br />
<br />
'''Proof''': Firstly, we need to establish <math>H</math> and <math>\pi</math> matrices commute.<br />
<br />
Since <math>H</math> is a centering matrix, we can write it as <math>H=I_{n}-11^{T}</math>.<br />
<br />
Actually, note that <math>\ H\pi=\pi H</math> iff <math>\ (I_{n}-11^{T})\pi=\pi (I_{n}-11^{T})</math> iff <math>\ 11^{T}\pi=\pi 11^{T}</math>, a result follows.<br />
<br />
Next, recall that the biased estimator of HSIC given finite sample <math>Z = \{(x_{i},<br />
y_{i})\}_{i=1}^{m}</math> drawn from <math>Pr_{xy}</math> is<br />
<br />
<math><br />
D(\mathcal {F},\mathcal {G},Z)=(m-1)^{-2}tr HKHL =<br />
(m-1)^{-2} tr \bar{K}\bar{L}<br />
</math><br />
<br />
where <math>K,L\in \mathbb{R}^{m\times m}</math> are the kernel matrices for<br />
the data and the labels respectively, i.e. <math>K=xx^{T}</math> and <math>L=yy^{T}</math>.<br />
<br />
Now, for any given pair <math>(x, y_{r})</math> between <math>X</math> and <math>Y</math>, we have <math>y_{r}=\pi y</math>.<br />
<br />
Note that <math>\pi</math> is a permutation matrix, we have <math>y=\pi^{T} y_{r}</math>, so the kernel matrix <math>L=\pi^{T}y_{r}y_{r}^{T}\pi</math>.<br />
<br />
Note that the kernel matrix <math>L_{r}=y_{r}y_{r}^{T}</math>, so the kernel matrix <math>L=\pi^{T}L_{r}\pi</math>.<br />
<br />
Note that <math>tr HKHL = tr HKHHLH </math>, since <math>H</math> is idempotent.<br />
<br />
So we have <math>tr HKHL = tr HKHHLH = tr \bar K H\pi^{T}L_{r}\pi H = tr \bar K \pi^{T}HL_{r}H\pi = tr \bar K \pi^{T}\bar L_{r}\pi </math>. <br />
<br />
Clearly, it is just our objective function.<br />
<br />
====Sorting as a special case====<br />
For general kernel matrices <math>K \,</math> and <math>L \,</math>, where <math>K_{ij}=k(x_i,x_j) \,</math> and <math>L_{ij}=l(x_i,x_j) \,</math>, the objective of the kernelized sorting problem, as explained above, is to find the permutation matrix <math> \pi \,</math> which maximizes <math>tr(\bar{K} \pi^{T}\bar{L}\pi ) = tr(HKH\pi^{T}HLH\pi)\, </math>.<br />
<br />
In the special case where the kernel functions <math>k\,</math> and <math>l\,</math> are the inner product in Euclidean space, we have <math>K=xx^{T}\,</math> and <math>L=yy^{T}\,</math>. Hence, we can rewrite the objective as <br />
<br />
<math>tr(HKH\pi^{T}HLH\pi) = tr(Hxx^{T}H\pi^{T}Hyy^{T}H\pi) = tr[Hx(Hx)^T\pi^{T}Hy(Hy)^T\pi] = tr[((Hx)^T\pi^{T}Hy) ((Hy)^T\pi Hx))]\,</math>, where the last step uses the property that trace is invariant under cyclic permutations.<br />
<br />
Note that <math>(Hx)^T\pi^{T}Hy \, </math> and <math> (Hy)^T\pi Hx = (Hx)^T\pi^{T}Hy \,</math> are scalars, therefore the objective is equal to <math> [(Hx)^T\pi (Hy)]^2 \,</math>.<br />
<br />
In the even more special case where the Euclidean space is the real line and the inner product is multiplication of real numbers, the centering matrix <math>H\,</math> merely translates the sample vector <math>y \,</math> (by the sample mean) and thus the order of <math>y \,</math> is preserved. Hence, maximizing <math> [(Hx)^T\pi (Hy)]^2 \,</math> can be solved by maximizing <math>x^T \pi y \,</math>. Under the further assumption that <math>x \,</math> is sorted ascendingly, maximizing <math> x^T \pi y \,</math> is equivalent to sorting <math>y \,</math> ascendingly, according to the Polya-Littlewood-Hardy inequality.<br />
<br />
===Diagonal Dominance===<br />
Replace the expectations by sums where no pairwise summation indices are identical. This leads to the objective function:<br />
<br />
<math><br />
\frac{1}{m(m-1)}\sum_{i\ne<br />
j}K_{ij}L_{ij}+\frac{1}{m^{2}(m-1)^{2}}\sum_{i\ne j,u\ne<br />
v}K_{ij}L_{uv}- \frac{2}{m(m-1)^2}\sum_{i,j\ne i,v\ne i}K_{ij}L_{iv}<br />
</math><br />
<br />
Using the <math>\bar{K}_{ij}=K_{ij}(1-\delta_{ij})</math> and<br />
<math>\bar{L}_{ij}=L_{ij}(1-\delta_{ij})</math> for kernel matrices where<br />
the main diagonal terms have been removed we arrive at the<br />
expression <math>(m-1)^{-1}tr<br />
H\bar{L}H\bar{K}</math>.<br />
<br />
===Relaxation to a constrained eigenvalue problem===<br />
An approximate solution of the problem by solving <br><br />
<br />
<math><br />
\text{maximize}_{\eta} \left\{ \eta^{T}M\eta \right\} \text{subject to} A\eta=b<br />
</math><br />
<br />
Here the matrix <math>M=K\otimes L\in \mathbb{R}^{m^{2}\times{m^2}}</math> is<br />
given by the outer product of the constituting kernel matrices,<br />
<math>\eta \in \mathbb{R}^{m^2}</math> is a vectorized version of the<br />
permutation matrix <math>\pi</math>, and the constraints imposed by <math>A</math> and <math>b</math><br />
amount to the polytope constraints imposed by <math>\Pi_{m}</math>.<br />
<br />
===Related Work===<br />
Mutual Information is defined as, <math>I(X,Y)=h(X)+h(Y)-h(X,Y)</math>. We can<br />
approximate MI maximization by maximizing its lower bound. This then<br />
corresponds to minimizing an upper bound on the joint<br />
entropy <math>h(X,Y)</math>.<br />
<br />
Optimization<br />
<br />
<math><br />
\pi^{\ast}=argmin_{\pi \in \Pi_{m}}|\log HJ(\pi)H|,<br />
</math><br />
<br />
where <math>\ J_{ij}=K_{ij}L_{\pi(i),\pi(j)}</math>. This is related to the<br />
optimization criterion proposed by Jebara(2004) in the context of<br />
aligning bags of observations by sorting via minimum volume PCA.<br />
<br />
===Multivariate Extensions===<br />
Let there be T random variables <math>x_i \in {\mathcal X}_i</math> which are jointly drawn from some distribution <math>p(x_1,...x_m)</math>. The expectation operator with respect to the joint distribution and with respect to the product of the marginals is given by<br />
<br />
<math><br />
\mathbb{E}_{x_1,...,x_T}[\prod_{i=1}^{T}k_{i}(x_{i},\cdot)]</math> and <math>\prod_{i=1}^{T}\mathbb{E}_{x_i}[k_{i}(x_{i},\cdot)]<br />
</math><br />
<br />
respectively. Both terms are equal if and only if all random variables are independent. The squared difference between both is given by<br />
<br />
<math><br />
\mathbb{E}_{x_{i=1}^T,{x'}_{i=1}^{T}}[\prod_{i=1}^{T}k_{i}(x_{i},x_{i}^{'})]+\prod_{i=1}^{T}\mathbb{E}_{x_{i},x_{i}^{'}}[k_{i}(x_{i},x_{i}^{'})]-2\mathbb{E}_{x_{i=1}^{T}}[\prod_{i=1}^{T}\mathbb{E}_{x_{i}^{'}}[k(x_{i},x_{i}^{'})]]<br />
</math><br />
<br />
which we refer to as multiway HSIC.<br />
<br />
Denote by <math>K_{i}</math> the kernel matrix obtained from the kernel <math>k_{i}</math> on the set of observations <math>X_{i}:=\{x_{i1},...,x_{im}\}</math>, the empirical estimate is given by<br />
<br />
<math><br />
HSIC[X_{1},...,X_{T}]:=1_{m}^{T}(\bigodot_{i=1}^{T}K_{i})1_{m}+\prod_{i=1}^{T}1_{m}^{T}K_{i}1_{m}-2\cdot1_{m}^{T}(\bigodot_{i=1}^{T}K_{i}1_{m})<br />
</math><br />
<br />
where <math>\bigodot_{t=1}^{T}\ast</math> denotes elementwise product of its arguments. To apply this to sorting we only need to define T permutation matrices <math>\pi_{i} \in \Pi_{m}</math> and replace the kernel matrices <math>K_{i}</math> by <math>\pi_{i}^{T}K_{i}\pi_{i}</math>.<br />
<br />
==Optimization==<br />
===Convex Objective and Convex Domain===<br />
<br />
Define <math>\pi</math> as a doubly stochastic matrix,<br />
<br />
<math><br />
P_{m}:=\{\pi \in \mathbb{R}^{m \times m} where<br />
\pi_{ij}\geqslant 0 and \sum_{i}\pi_{ij}=1 and \sum_{j}\pi_{ij}=1\}<br />
</math><br />
<br />
The objective function <br />
<math>tr K<br />
\pi^{T}L\pi</math> is convex in <math>\pi</math> .<br />
<br />
===Convex-Concave Procedure===<br />
<br />
Compute successive linear lower bounds and maximize<br />
<math><br />
\pi_{i+1}\leftarrow \arg\max_{\pi \in P_{m}}[tr<br />
\bar{K} \pi^{T}\bar{L} \pi_{i}]<br />
</math><br />
<br />
This will converge to a local maximum.<br />
<br />
Initialization is done via sorted principal eigenvector.<br />
<br />
===An tentative explanation for this part===<br />
Basically, I think the optimizing method used in this paper does not apply the Concave Convex Procedure exactly. As I said on Tuesday, I think it just "borrowed" the idea form the Concave Convex Procedure since there is no concave part in this question.<br />
<br />
Accoding to the paper, the Concave Convex Procedure works as<br />
follows: <math>f(x)=g(x)-h(x)</math>, where <math>g</math> is convex and <math>h</math> is concave, a<br />
lower bound can be found by<br />
<br />
<math><br />
f(x) \ge g(x_{0}) + \langle x-x_{0},\partial_{x}g(x_{0}) \rangle<br />
-h(x)<br />
</math><br />
<br />
For the objecitve function in the kernelized sorting method, it can be written in the following format<br />
<br />
<math>f(\pi)=g(\pi)= tr \bar{K}<br />
\pi^{T}\bar{L}\pi</math><br />
<br />
Currently, suppose we have <math>\pi_{0}</math> and <math>g(\pi_{0}) = tr\bar{K} \pi_{0}^{T}\bar{L}\pi_{0}</math>.<br />
<br />
We know that <math>\bigtriangledown_{A} tr ABA^{T}C=CAB+C^{T}AB^{T}</math>.<br />
<br />
So <math> \bigtriangledown_{\pi} tr \bar K<br />
\pi^{T} \bar L \pi=\bigtriangledown_{\pi} tr \pi \bar K \pi^{T} \bar L = \bar L \pi \bar K+\bar L^{T} \pi \bar K^{T}</math>.<br />
Since <math>\bar K</math> and <math>\bar L</math> are symmetric matrix, we get<br />
<br />
<math>\bigtriangledown_{\pi} tr \pi \bar<br />
K \pi^{T} \bar L = 2\bar L \pi \bar K</math>.<br />
<br />
Hence, we get<br />
<math><br />
\langle \pi - \pi_{0}, \bigtriangledown_{\pi} tr \pi_{0} \bar K \pi_{0}^{T} \bar L\rangle <br />
=\langle \pi - \pi_{0}, 2\bar L \pi_{0} \bar K\rangle<br />
=2tr (\pi - \pi_{0})^{T}\bar L \pi_{0} \bar K<br />
=2tr \bar K(\pi - \pi_{0})^{T}\bar L \pi_{0}<br />
</math><br />
<br />
In this case, we can get<br />
<math><br />
f(\pi)\ge tr <br />
\bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + 2tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
We can drop the coefficient <math>2</math> since we just want to maximize this<br />
lower bound, so we get<br />
<math>f(\pi)\ge tr \bar{K} \pi_{0}^{T}\bar{L}\pi_{0} + tr \bar K(\pi -<br />
\pi_{0})^{T}\bar L \pi_{0} </math><br />
<br />
Hence, we get<br />
<math>f(\pi)\ge tr <br />
\bar{K} \pi^{T}\bar{L}\pi_{0} </math><br />
<br />
So this is the lower bound we want to update repeatedly.<br />
<br />
Actually, I think if the Kernel matrices <math>K</math> and <math>L</math> are well defined and computed already, there is no too much parameters in this optimization problem, which means some stochastic gradient descent method can be applied. In fact, I think this question is easier than the assignment problem based the same argument about the complexity of parameters. Hence, interpreting it as a TSP problem and for example, using Simulated Annealing algorithm, is an acceptable method.<br />
<br />
== Application ==<br />
<br />
Assume that we may want to visualize data according to the metric structure inherent in it. More specifically, our objective is to align it <br />
according to a given template, such as a grid, a torus, or any other fixed structure. Such problems occur when presenting images or documents to a user. Most of the algorithms for low dimensional object layout suffer from the problem that the low dimensional presentation is nonuniform. This has the advantage of revealing cluster structure but given limited screen size the presentation is undesirable. To address this problem, we can use kernelized sorting to align objects. In this scenario the kernel matrix L is given by the similarity measure between the objects <math> x_i </math> that are to be aligned. The kernel K, on the other hand, denotes the similarity between the locations where objects are to be aligned to. For the sake of simplicity we used a Gaussian RBF kernel between the objects to laid out and also between the positions of the grid, i.e. <math>\mathbf{k(x,x') = exp(-gamma ||x-x'||^2) }</math>.<br />
<br />
==Summary==<br />
<br />
We generalize sorting by maximizing dependency between matched pairs of observations via HSIC.<br />
<br />
Applications of our proposed sorting algorithm range from data visualization to image, data attribute and multilingual document matching.</div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3794visualizing Data using t-SNE2009-08-02T21:19:32Z<p>Amir: /* Experiments with Different Data Sets */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>. In this technique, called UNI-SNE, <math>\,q_{ij}</math> will be larger than <math>\,p_{ij}</math> even for the far-apart datapoints.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. In addition, t-distribution as a heavier tail distribution allows a temperate distance to be modeled by a larger distance in the map that eliminates the unwanted attractive forces between dissimilar data points.<br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. <br />
<br />
We show the results of our experiments with t-SNE, Sammon mapping, Isomap, and LLE on the MNIST data set in figures 2 and 3. <br />
<br />
The results is a good indicator of strong performance of t-SNE compared to the other techniques. In particular, Sammon mapping constructs a “ball” in which only three classes (representing the digits 0, 1, and 7) are somewhat separated from the other classes. Isomap and<br />
LLE produce solutions in which there are large overlaps between the digit classes. In contrast, t-SNE constructs a map in which the separation between the digit classes is almost perfect. Moreover, detailed inspection of the t-SNE map reveals that much of the local structure of the data (such as the orientation of the ones) is captured as well. The map produced by t-SNE contains some points that are clustered with the wrong class, but most of these points correspond to distorted digits many of which are difficult to identify.<br />
<center>[[File:T-SNE-Fig2.JPG]]</center><br />
<br />
<center>[[File:T-SNE-Fig3.JPG]]</center><br />
<br />
In figure 4 we have showed the results of applying t-SNE, Sammon mapping, Isomap, and LLE to the Olivetti faces data set. Similar to what we had before, Isomap and LLE produce solutions that provide little insight into the class structure of the data. Since Sammon mapping models many of the members of each class fairly close together, but none of the classes are clearly separated in the Sammon map, it produces a much better map compared to the other methods. In contrast, t-SNE does a much better job of revealing the natural classes in the data. Some individuals have their ten images split into two clusters, usually because a subset of the images have the head facing in a significantly different direction, or because they have a very different expression or glasses. For these individuals, it is not clear that their ten images form a natural class when using Euclidean distance in pixel space.<br />
<br />
<center>[[File:T-SNE-Fig4.JPG]]</center><br />
<br />
Figure 5 shows the results of applying t-SNE, Sammon mapping, Isomap, and LLE to the COIL-20 data set. An interesting observation in this ection is that for many of the 20 objects, t-SNE accurately represents the one-dimensional manifold of viewpoints as a closed loop. For objects which look similar from the front and the back, t-SNE distorts the loop so that the images of front and back are mapped to nearby points. For the four types of toy car in the COIL-20 data set (the four aligned “sausages” in the bottom-left of the t-SNE map), the four rotation manifolds are aligned by the orientation of the cars to capture the high similarity between different cars at the same orientation. This prevents t-SNE from keeping the four manifolds clearly separate. <br />
<br />
An interesting point shown in figure 5 is that the other three techniques are not nearly as good at cleanly separating the manifolds that correspond to very different objects. On top of that Isomap and LLE only visualize a small number of classes from the COIL-20 data set, because the data set comprises a large number of widely separated sub-manifolds that give rise to small connected components in the neighborhood graph.<br />
<br />
<center>[[File:T-SNE-Fig5.JPG]]</center><br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3793visualizing Data using t-SNE2009-08-02T21:03:16Z<p>Amir: /* Experiments with Different Data Sets */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>. In this technique, called UNI-SNE, <math>\,q_{ij}</math> will be larger than <math>\,p_{ij}</math> even for the far-apart datapoints.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. In addition, t-distribution as a heavier tail distribution allows a temperate distance to be modeled by a larger distance in the map that eliminates the unwanted attractive forces between dissimilar data points.<br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. <br />
<br />
<center>[[File:T-SNE-Fig2.JPG]]</center><br />
<br />
<center>[[File:T-SNE-Fig3.JPG]]</center><br />
<br />
<center>[[File:T-SNE-Fig4.JPG]]</center><br />
<br />
<center>[[File:T-SNE-Fig5.JPG]]</center><br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:T-SNE-Fig5.JPG&diff=3792File:T-SNE-Fig5.JPG2009-08-02T21:03:01Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:T-SNE-Fig4.JPG&diff=3791File:T-SNE-Fig4.JPG2009-08-02T21:02:27Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3790visualizing Data using t-SNE2009-08-02T21:01:57Z<p>Amir: /* Experiments with Different Data Sets */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>. In this technique, called UNI-SNE, <math>\,q_{ij}</math> will be larger than <math>\,p_{ij}</math> even for the far-apart datapoints.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. In addition, t-distribution as a heavier tail distribution allows a temperate distance to be modeled by a larger distance in the map that eliminates the unwanted attractive forces between dissimilar data points.<br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. <br />
<br />
<center>[[File:T-SNE-Fig2.JPG]]</center><br />
<br />
<center>[[File:T-SNE-Fig3.JPG]]</center><br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:T-SNE-Fig3.JPG&diff=3789File:T-SNE-Fig3.JPG2009-08-02T21:01:23Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:T-SNE-Fig2.JPG&diff=3788File:T-SNE-Fig2.JPG2009-08-02T21:00:04Z<p>Amir: uploaded a new version of "File:T-SNE-Fig2.JPG"</p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3787visualizing Data using t-SNE2009-08-02T20:58:47Z<p>Amir: /* Experiments with Different Data Sets */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>. In this technique, called UNI-SNE, <math>\,q_{ij}</math> will be larger than <math>\,p_{ij}</math> even for the far-apart datapoints.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. In addition, t-distribution as a heavier tail distribution allows a temperate distance to be modeled by a larger distance in the map that eliminates the unwanted attractive forces between dissimilar data points.<br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. <br />
<br />
<center>[[File:T-SNE-Fig2.JPG]]</center><br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3786visualizing Data using t-SNE2009-08-02T20:57:30Z<p>Amir: /* Experiments with Different Data Sets */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>. In this technique, called UNI-SNE, <math>\,q_{ij}</math> will be larger than <math>\,p_{ij}</math> even for the far-apart datapoints.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. In addition, t-distribution as a heavier tail distribution allows a temperate distance to be modeled by a larger distance in the map that eliminates the unwanted attractive forces between dissimilar data points.<br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. <br />
<br />
<center>File:T-SNE-Fig2.JPG</center><br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:T-SNE-Fig2.JPG&diff=3785File:T-SNE-Fig2.JPG2009-08-02T20:56:55Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3784relevant Component Analysis2009-08-02T18:46:29Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
where <math>\mathbf{1\{.\}}</math> is the indicator function, <math>\mathbf{\{\hat{c_i}\}_{i=1}^m}</math> is the cluster to which point <math> \mathbf{x_i} </math> is assigned by the clustering algorithm, and <math> \mathbf{c_i} </math> is the "correct" or desired assignment. The above score can be regarded as computing the probability that the algorithm's assignment <math> \mathbf{\hat{c}} </math> of two randomly drawn points <math> \mathbf{x_i} </math> and <math> \mathbf{x_j} </math> agrees with the "true" asignment <math> \mathbf{c} </math>.<br />
<br />
<center>[[File:UC Irvive data results.JPG]]</center><br />
<br />
Similar to what (Xing et al., 2002) have done, we tested our method using two conditions:<br />
<br />
I) Using "little" side-information <math> \mathbf{S} </math> <br />
II) Using "much" side-information.<br />
<br />
In all of four experiments we sued K-means with multiple restarts. We have showed the results of all algorithms described above when we use the two conditions of "little" and "much" side-information.<br />
<br />
As it can be seen clearly in results, using RCA as a distance measure has a significant impact on improving the results over the original K-means algorithm. Our results compared to (Xing et al., 2002) indicate that RCA achieves similar results. In this respect it should be noted that the RCA metric computation is a single step efficient computation, whereas the method presented in (Xing et al., 2002) requires gradient descent and iterative projections.<br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3783relevant Component Analysis2009-08-02T18:33:03Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
where <math>\mathbf{1\{.\}}</math> is the indicator function, <math>\mathbf{\{\hat{c_i}\}_{i=1}^m}</math> is the cluster to which point <math> \mathbf{x_i} </math> is assigned by the clustering algorithm, and <math> \mathbf{c_i} </math> is the "correct" or desired assignment. The above score can be regarded as computing the probability that the algorithm's assignment <math> \mathbf{\hat{c}} </math> of two randomly drawn points <math> \mathbf{x_i} </math> and <math> \mathbf{x_j} </math> agrees with the "true" asignment <math> \mathbf{c} </math>.<br />
<br />
<center>[[File:UC Irvive data results.JPG]]</center><br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3782relevant Component Analysis2009-08-02T18:31:40Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
where <math>\mathbf{1\{.\}}</math> is the indicator function, <math>\mathbf{\{\hat{c_i}\}_{i=1}^m}</math> is the cluster to which point <math> \mathbf{x_i} </math> is assigned by the clustering algorithm, and <math> \mathbf{c_i} </math> is the "correct" or desired assignment. The above score can be regarded as computing the probability that the algorithm's assignment <math> \mathbf{\hat{c}} </math> of two randomly drawn points <math> \mathbf{x_i} </math> and <math> \mathbf{x_j} </math> agrees with the "true" asignment <math> \mathbf{c} </math>.<br />
<br />
[[File:UC Irvive data results.JPG]]<br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:UC_Irvive_data_results.JPG&diff=3781File:UC Irvive data results.JPG2009-08-02T18:30:50Z<p>Amir: </p>
<hr />
<div></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3780relevant Component Analysis2009-08-02T18:29:26Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
where <math>\mathbf{1\{.\}}</math> is the indicator function, <math>\mathbf{\{\hat{c_i}\}_{i=1}^m}</math> is the cluster to which point <math> \mathbf{x_i} </math> is assigned by the clustering algorithm, and <math> \mathbf{c_i} </math> is the "correct" or desired assignment. The above score can be regarded as computing the probability that the algorithm's assignment <math> \mathbf{\hat{c}} </math> of two randomly drawn points <math> \mathbf{x_i} </math> and <math> \mathbf{x_j} </math> agrees with the "true" asignment <math> \mathbf{c} </math>.<br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3779relevant Component Analysis2009-08-02T18:27:49Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
where <math>\mathbf{1\{.\}}</math> is the indicator function, <math>\mathbf{\{\hat{c_i}\}_{i=1}^m}</math> is the cluster to which point <math> \mathbf{x_i} </math> is assigned by the clustering algorithm, and <math> \mathbf{c_i} </math> is the "correct" or desired assignment. The above score can be regarded as computing the probability that the algorithm's assignment <math> \mathbf{\hat{c}}<\math> of two randomly drawn points <math> \mathbf{x_i} </math> and <math> \mathbf{x_j} </math> agrees with the "true" asignment <math> \mathbf{c} </math>.<br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3778relevant Component Analysis2009-08-02T18:26:23Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
where <math>\mathbf{1\{.\}}</math> is the indicator function, <math>\mathbf{\{\hat{c_i}\}_{i=1}^m}</math> is the cluster to which point <math> \mathbf{x_i} </math> is assigned by the clustering algorithm, and <math> \mathbf{c_i} </math> is the "correct" or desired assignment. The above score can be regarded as computing the probability that the algorithm's assignment <math>\mathbf{\hat{c}}<\math>of two randomly drawn points <math> \mathbf{x_i} </math> and <math> \mathbf{x_j} </math> agrees with the "true" asignment <math> \mathbf{c} </math>.<br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3777relevant Component Analysis2009-08-02T18:19:11Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
where <math>\mathbf{1\{.\}}</math> is the indicator function, <math>\mathbf{\{\hat{c_i}\}_{i=1}^m}</math> is the cluster to which point<br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3776relevant Component Analysis2009-08-02T18:17:10Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{ 1\{1 \{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}\}} {0.5m(m-1)} }</math></center><br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3775relevant Component Analysis2009-08-02T18:14:55Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{1\{ \1{c_i=c_j\}=1\{\hat{c_i}=\hat{c_j}\}}}{0.5m(m-1)}}</math></center><br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3774relevant Component Analysis2009-08-02T18:13:36Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
5. EM: Expectation Maximization of a Gaussian Mixture model (using no side-information).<br />
<br />
6. Constrained EM: EM using side-information in the form of equivalence constraints (Hertz et al., 2002; Shental et al., 2003), when using RCA distance metric as an initial metric. <br />
<br />
Following (Xing et al., 2002) a normalized accuracy score is used to evaluate the partitions obtained by the different clustering algorithms which we pointed out in the above six methods. More specifically, in the case of 2-cluster data the accuracy measure used can be written as:<br />
<br />
<center><math>\mathbf{\sum_{i>j}\frac{1{1{c_i=c_j}=1{\hat{c_i}=\hat{c_j}}}}{0.5m(m-1)}}</math></center><br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3773relevant Component Analysis2009-08-02T18:01:15Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math> \mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
== References ==<br />
<references/></div>Amirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3772relevant Component Analysis2009-08-02T18:00:59Z<p>Amir: /* Experimental Results: Application to Clustering */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== Experimental Results: Application to Clustering == <br />
<br />
The main goal of this method is to utilize the side information in the form of equivalence relations to improve the performance of unsupervised learning techniques. To test the proposed the above RCA algorithm and for the sake of comparison of our results by Xing et al. we used six data sets from UC Irvine repository which were used in <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref>. Similar to what they have in their paper, we are given as set S of pairwise similarity constraints; Having this data set, we performed the following clustering algorithms:<br />
<br />
1. K-means using the default Euclidean metric (i.e. using no side information) .<br />
<br />
2. Constrained K-means: K-means subject to points <math>mathbf{(x_i,x_j) \in S } </math> always being assigned to the same cluster (Wagstaff et al. ,2001).<br />
<br />
3. Constrained K-means + metric proposed by (Xing et al., 2002): Constrained K-means using the distance metric proposed in (Xing et al., 2002), which is learned from S.<br />
<br />
4. Constrained K-means + RCA: Constrained K-means using the RCA distance metric learned from S.<br />
<br />
== References ==<br />
<references/></div>Amir