http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Myakhave&feedformat=atomstatwiki - User contributions [US]2024-03-29T15:00:46ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3905convex and Semi Nonnegative Matrix Factorization2009-08-14T22:49:55Z<p>Myakhave: /* SVD, Convex-NMF and Semi-NMF Comparison */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==SVD, Convex-NMF and Semi-NMF Comparison==<br />
Considering G and F as the result of matrix factorization through SVD, Convex-NMF, and semi-NMF factorizatrion, It can be shown that <br />
Semi-NMF and Convex-NMF factorizations gives clustering results identical to the K-means clustering.<br /><br />
Sharper indicators of the clustering is given by Convex-NMF.<br /><br />
<math>\,F_{cnvx}</math> is close to <math>\,C_{Kmeans}</math>, however, <math>\,F_{semi}</math>is not. The intuition behind this is that the restrictions on F can have large effects on subspace factorization<br /><br />
Getting larger residual values, <math>\,\|X-FG^T\|</math> for Convex-NMF intuitively says that the more highly constrained (Convex-NMF), the more degredation in accuracy.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
<br />
=== C. Shifting mixed-sign data to nonnegative ===<br />
<br />
In this section we used only nonnegative by adding the smallest constant so all entries are nonnegative and performed experiments on data shifted in this way for the Wave and Ionosphere data. For Wave, the accuracy decreases to 0.503 from 0.590 for Semi-NMF and decreases to 0.5297 from 0.5738 for Convex-NMF. The sparsity increases to 0.586 from 0.498 for Convex-NMF. For Ionosphere, the accuracy decreases to 0.647 from 0.729 for Semi-NMF and decreases to 0.618 from 0.6877 for Convex-NMF. The sparsity increases to 0.829 from 0.498 for Convex-NMF. <br />
<br />
<center>[[File:Convex-Fig2.JPG]]</center><br />
<br />
In short, the shifting approach does not appear to provide a satisfactory alternative.<br />
<br />
=== D. Flexibility of NMF ===<br />
In general NMF almost always performs better than K-means in terms of clustering accuracy while providing a matrix approximation. This could be due to the flexibility of matrix factorization as compared to the rigid spherical clusters that the K-means clustering objective function attempts to capture. When the data distribution is far from a spherical clustering, NMF may have advantages. Figure 2 gives an example. The dataset consists of two parallel rods in 3D space containing 200 data points. The two central axes of the rods are 0.3 apart and they have diameter 0.1 and length 1. As seen in the figure, K-means gives a poor clustering, while NMF yields a good clustering. The bottom panel of Figure 2 shows the differences in the columns of G (each column is normalized to Pi gk(i) = 1). The mis-clustered points have small differences. Note that NMF is initialized randomly for the different runs. The stability of the solution over multiple runs was investigated; The results indicate that NMF converges to solutions F and G that are very similar across runs; moreover, the resulting discretized cluster indicators were identical.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3904convex and Semi Nonnegative Matrix Factorization2009-08-14T22:42:38Z<p>Myakhave: /* SVD, Convex-NMF and Semi-NMF Comparison */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==SVD, Convex-NMF and Semi-NMF Comparison==<br />
Considering G and F as the result of matrix factorization through SVD, Convex-NMF, and semi-NMF factorizatrion, It can be shown that <br />
Semi-NMF and Convex-NMF factorizations gives clustering results identical to the K-means clustering.<br /><br />
Sharper indicators of the clustering is given by Convex-NMF.<br /><br />
<math>\,F_{cnvx}</math> is close to <math>\,C_{Kmeans}</math>, however, <math>\,F_{semi}</math>is not. The intuition behind this is taht F can have large effects on subspace factorization<br /><br />
Getting larger residual values, <math>\,\|X-FG^T\|</math> for Convex-NMF intuitively says that the more highly constrained (Convex-NMF), the more degredation in accuracy.<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
<br />
=== C. Shifting mixed-sign data to nonnegative ===<br />
<br />
In this section we used only nonnegative by adding the smallest constant so all entries are nonnegative and performed experiments on data shifted in this way for the Wave and Ionosphere data. For Wave, the accuracy decreases to 0.503 from 0.590 for Semi-NMF and decreases to 0.5297 from 0.5738 for Convex-NMF. The sparsity increases to 0.586 from 0.498 for Convex-NMF. For Ionosphere, the accuracy decreases to 0.647 from 0.729 for Semi-NMF and decreases to 0.618 from 0.6877 for Convex-NMF. The sparsity increases to 0.829 from 0.498 for Convex-NMF. <br />
<br />
<center>[[File:Convex-Fig2.JPG]]</center><br />
<br />
In short, the shifting approach does not appear to provide a satisfactory alternative.<br />
<br />
=== D. Flexibility of NMF ===<br />
In general NMF almost always performs better than K-means in terms of clustering accuracy while providing a matrix approximation. This could be due to the flexibility of matrix factorization as compared to the rigid spherical clusters that the K-means clustering objective function attempts to capture. When the data distribution is far from a spherical clustering, NMF may have advantages. Figure 2 gives an example. The dataset consists of two parallel rods in 3D space containing 200 data points. The two central axes of the rods are 0.3 apart and they have diameter 0.1 and length 1. As seen in the figure, K-means gives a poor clustering, while NMF yields a good clustering. The bottom panel of Figure 2 shows the differences in the columns of G (each column is normalized to Pi gk(i) = 1). The mis-clustered points have small differences. Note that NMF is initialized randomly for the different runs. The stability of the solution over multiple runs was investigated; The results indicate that NMF converges to solutions F and G that are very similar across runs; moreover, the resulting discretized cluster indicators were identical.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3903convex and Semi Nonnegative Matrix Factorization2009-08-14T22:40:48Z<p>Myakhave: /* SVD, Convex-NMF and Semi-NMF Comparison */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==SVD, Convex-NMF and Semi-NMF Comparison==<br />
Considering G and F as the result of matrix factorization through SVD, Convex-NMF, and semi-NMF factorizatrion, It can be shown that <br />
Semi-NMF and Convex-NMF factorizations gives clustering results identical to the K-means clustering.<br /><br />
Sharper indicators of the clustering is given by Convex-NMF.<br /><br />
<math>\,F_{cnvx}</math> is close to <math>\,C_{Kmeans}</math>, however, <math>\,F_{semi}</math>is not. The intuition behind this is taht F can have large effects on subspace factorization<br /><br />
Getting larger residual values, <math>\,\|X-FG^T\|</math> for Convex-NMF comes up with the fact that more highly constrained Convex-NMF<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
<br />
=== C. Shifting mixed-sign data to nonnegative ===<br />
<br />
In this section we used only nonnegative by adding the smallest constant so all entries are nonnegative and performed experiments on data shifted in this way for the Wave and Ionosphere data. For Wave, the accuracy decreases to 0.503 from 0.590 for Semi-NMF and decreases to 0.5297 from 0.5738 for Convex-NMF. The sparsity increases to 0.586 from 0.498 for Convex-NMF. For Ionosphere, the accuracy decreases to 0.647 from 0.729 for Semi-NMF and decreases to 0.618 from 0.6877 for Convex-NMF. The sparsity increases to 0.829 from 0.498 for Convex-NMF. <br />
<br />
<center>[[File:Convex-Fig2.JPG]]</center><br />
<br />
In short, the shifting approach does not appear to provide a satisfactory alternative.<br />
<br />
=== D. Flexibility of NMF ===<br />
In general NMF almost always performs better than K-means in terms of clustering accuracy while providing a matrix approximation. This could be due to the flexibility of matrix factorization as compared to the rigid spherical clusters that the K-means clustering objective function attempts to capture. When the data distribution is far from a spherical clustering, NMF may have advantages. Figure 2 gives an example. The dataset consists of two parallel rods in 3D space containing 200 data points. The two central axes of the rods are 0.3 apart and they have diameter 0.1 and length 1. As seen in the figure, K-means gives a poor clustering, while NMF yields a good clustering. The bottom panel of Figure 2 shows the differences in the columns of G (each column is normalized to Pi gk(i) = 1). The mis-clustered points have small differences. Note that NMF is initialized randomly for the different runs. The stability of the solution over multiple runs was investigated; The results indicate that NMF converges to solutions F and G that are very similar across runs; moreover, the resulting discretized cluster indicators were identical.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3902convex and Semi Nonnegative Matrix Factorization2009-08-14T22:40:16Z<p>Myakhave: /* SVD, Convex-NMF and Semi-NMF comparison */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==SVD, Convex-NMF and Semi-NMF Comparison==<br />
Considering G and F as the result of matrix factorization through SVD, Convex-NMF, and semi-NMF factorizatrion, It can be shown that <br />
Semi-NMF and Convex-NMF factorizations gives clustering results identical to the K-means clustering.<br /><br />
Sharper indicators of the clustering is given by Convex-NMF.<br /><br />
<math>\,F_{cnvx}</math> is close to <math>\,C_{Kmeans}</math>, however, <math>\,F_{semi}</math>is not. The intuition behind this is taht F can have large effects on subspace factorization<br /><br />
Getting larger residual values, <math>\,\||X-FG^T\||</math> for Convex-NMF comes up with the fact that more highly constrained Convex-NMF<br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
<br />
=== C. Shifting mixed-sign data to nonnegative ===<br />
<br />
In this section we used only nonnegative by adding the smallest constant so all entries are nonnegative and performed experiments on data shifted in this way for the Wave and Ionosphere data. For Wave, the accuracy decreases to 0.503 from 0.590 for Semi-NMF and decreases to 0.5297 from 0.5738 for Convex-NMF. The sparsity increases to 0.586 from 0.498 for Convex-NMF. For Ionosphere, the accuracy decreases to 0.647 from 0.729 for Semi-NMF and decreases to 0.618 from 0.6877 for Convex-NMF. The sparsity increases to 0.829 from 0.498 for Convex-NMF. <br />
<br />
<center>[[File:Convex-Fig2.JPG]]</center><br />
<br />
In short, the shifting approach does not appear to provide a satisfactory alternative.<br />
<br />
=== D. Flexibility of NMF ===<br />
In general NMF almost always performs better than K-means in terms of clustering accuracy while providing a matrix approximation. This could be due to the flexibility of matrix factorization as compared to the rigid spherical clusters that the K-means clustering objective function attempts to capture. When the data distribution is far from a spherical clustering, NMF may have advantages. Figure 2 gives an example. The dataset consists of two parallel rods in 3D space containing 200 data points. The two central axes of the rods are 0.3 apart and they have diameter 0.1 and length 1. As seen in the figure, K-means gives a poor clustering, while NMF yields a good clustering. The bottom panel of Figure 2 shows the differences in the columns of G (each column is normalized to Pi gk(i) = 1). The mis-clustered points have small differences. Note that NMF is initialized randomly for the different runs. The stability of the solution over multiple runs was investigated; The results indicate that NMF converges to solutions F and G that are very similar across runs; moreover, the resulting discretized cluster indicators were identical.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=convex_and_Semi_Nonnegative_Matrix_Factorization&diff=3901convex and Semi Nonnegative Matrix Factorization2009-08-14T22:27:09Z<p>Myakhave: /* Convex NMF */</p>
<hr />
<div>In the paper ‘Convex and semi non negative matrix factorization’, Jordan et al <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization”. </ref> have proposed new NMF like algorithms on mixed sign data, called Semi NMF and Convex NMF. They also show that a kernel form of NMF can be obtained by ‘kernelizing’ convex NMF. They explore the connection between NMF algorithms and K means clustering to show that these NMF algorithms can be used for clustering in addition to matrix approximation. These new variants of algorithm thereby, broaden the application areas of NMF algorithm and also provide better interpretability to matrix factors.<br />
<br />
==Introduction==<br />
Nonnegative matrix factorization (NMF), factorizes a matrix X into two matrices F and G, with the constraints that all the three matrices are non negative i.e. they contain only positive values or zero but no negative values, such as:<br />
<math>X_+ \approx F_+{G_+}^T</math><br />
where ,<math> X \in {\mathbb R}^{p \times n}</math> , <math> F \in {\mathbb R}^{p \times k}</math> , <math> G \in {\mathbb R}^{n \times k}</math><br />
<br />
The least square objective function of NMF is:<br />
<math> \mathbf {E(F,G) = \|X-FG^T\|^2}</math><br />
<br />
It has been shown that it is a NP hard problem and is convex in only F or only G but not convex in both F and G simultaneously <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref> Also, the factors F and G are not always sparse and many different sparsification schemes have been applied to NMF.<br />
<br />
==Semi NMF==<br />
In semi NMF, the matrix G is constrained to be nonnegative whereas the data matrix X and the basis vectors of F are unconstrained, that is:<br />
<br />
<math>X_{\pm} \approx F_{\pm}{G_+}^T</math><br />
<br />
They were motivated to this kind of factorization by K means clustering. The objective function of K means can be written in the form of matrix approximation as follows:<br />
<br />
<math> J_{K-means} = \sum_{i=1}^n \sum_{k=1}^K g_{ik}||x_i-f_k||^2=||X-FG^T||^2 </math> <br />
<br />
where, X is a mixed sign data matrix , F represents cluster centroids having both positive and negative entries and G represents cluster indicators having nonnegative entries.<br />
<br />
K means clustering objective function can be viewed as Semi NMF matrix approximation with relaxed constraint on G. That is G is allowed to range over values (0, 1) or (0, infinity).<br />
<br />
==Convex NMF==<br />
While in Semi NMF, there is no constraint imposed upon the basis vector F, but in Convex NMF, the columns of F are restricted to be a convex combination of columns of data matrix X, such as:<br />
<br />
<math> F=(f_1, \cdots , f_k)</math><br />
<br />
<math> f_l=w_{1l}x_1+ \cdots + w_{nl}x_n = Xw_l = XW</math> such that,<br />
<math> w_{ij}>0</math> <math>\forall i,j </math> <br />
<br />
In this factorization each column of matrix F is a weighted sum of certain data points. This implies that we can think of F as weighted cluster centroids.<br />
<br />
Convex NMF has the form:<br />
<math> X_{\pm} \approx X_{\pm}W_+{G_+}^T</math><br />
<br />
As F is considered to represent weighted cluster centroid, the constraint <math> \sum _{i=1}^n w_i = 1 </math> must be satisfied. But the authors do not actually state this.<br />
<br />
==SVD, Convex-NMF and Semi-NMF comparison==<br />
Considering G and F as the result of matrix factorization through SVD, Convex-NMF, and semi-NMF factorizatrion, It can be shoen that <br />
Semi-NMF and Convex-NMF factorizations gives clustering results identical to the K-means clustering.<br /><br />
Convex-NMF gives sharper indicators of the clustering.<br /><br />
<math>\,F_{cnvx}</math> is close to <math>\,C_{Kmeans}</math>, however, <math>\,F_{semi}</math>is not. The intuition behind this is taht F can have large effects on subspace factorization<br /><br />
<br />
==Algorithms==<br />
The algorithms for these variants of NMF are based on iterative updating algorithms proposed for the original NMF, in which the factors are alternatively updated until convergence <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>. At each iteration of algorithm, the value for F or G is found by multiplying its current value by some factor. In <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>, they prove that by repeatedly applying these multiplicative update rules, the quality of approximation smoothly improves. That is, the update rule guarantees convergence to a locally optimal matrix factorization. In this paper, the same approach has been used by authors to present the algorithms for Semi NMF and Convex NMF.<br />
<br />
===Algorithm for Semi NMF===<br />
<br />
As already stated, the factors for semi NMF are computed by using an iterative updating algorithm that alternatively updates F and G till convergence is reached.<br />
<br />
*'''Step 1''': Initialize G<br />
**Obtain cluster indicators by K means clustering. <br />
*'''Step 2''': Update F, fixing G using the rule:<br />
<math>\mathbf{ F = XG(G^TG)^{-1}} </math><br />
<br />
*'''Step 3''': Update G, fixing F using the rule:<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {{(X^TF)^+}_{ik} + [G(F^TF)^-]_{ik}}{{(X^TF)^-}_{ik} + [G(F^TF)^+]_{ik}}}</math><br />
<br />
where, the positive and negative parts of a matrix are separated as:<br />
<math> {A_{ik}}^{+}=(|A_{ik}|+A_{ik})/2 </math> , <math> {A_{ik}}^{-}=(|A_{ik}|- A_{ik})/2 </math><br />
<br />
and, <math> A_{ik}= {A_{ik}}^{+} - {A_{ik}}^{-} </math><br />
<br />
<br><br />
'''Theorem 1:''' (A) The update rule for F gives the optimal solution to the <math> min_F \|X - FG^T\|^2 </math>, while G is fixed. (B) When F is fixed, the residual <math> \|X - FG^T\|^2 </math> decreases monotonically under the update rule for G.<br />
<br />
'''Proof:'''<br />
<br />
(Not going to prove the entire theorem but discuss the main parts)<br />
<br />
The objective function for semi NMF is:<br />
<math> J=\|X - FG^T\|^2= Tr(X^TX - 2X^TFG^T + GF^TFG^T) </math>.<br />
<br />
(A).The problem is unconstrained and the solution for F is trivial, given by:<br />
<math>dJ/dF = -2XG + 2FG^TG = 0</math><br />
<br>Therefore, <math> F = XG(G^TG)^{-1} </math><br />
<br />
(B).This is a constraint problem having an inequality constraint. Because it is a constraint problem, solved by using Lagrange multipliers but the solution for the update rule must satisfy KKT condition at convergence. This implies the correctness of solution. Secondly, the update rule should cause the solution to converge. In the paper, correctness and convergence of update rule is proved as follows:<br />
<br />
<br><br />
<br />
(i)'''Correctness of solution:'''<br />
<br />
Lagrange function is: <math> L(G) = Tr (-2X^TFG^T + GF^TFG^T - \Beta G^T) </math> <br />
<br> where, <math> \Beta_{ij}</math> are the Lagrange multipliers enforcing the non negativity constraint on G.<br />
<br>Therefore, <math> \frac {\part L}{\part G}= -2X^TF + 2GF^TF - \Beta = 0 </math> <br />
<br> From complementary slackness condition, <math> (-2X^TF + 2GF^TF)_{ik}G_{ik} = \Beta_{ik}G_{ik} = 0. </math> <br />
<br> The above equation must be satisfied at convergence.<br />
<br> The update rule for G can be reduced to: <br />
<math> (-2X^TF + 2GF^TF)_{ik}{G_{ik}}^2 = 0 </math> at convergence.<br />
<br> Both equations are identical and therefore the update rule satisfies the KKT fixed point condition.<br />
<br><br />
<br />
<br />
(ii)'''Convergence of the solution given by update rule:'''<br />
<br />
The authors used an auxiliary function approach to prove convergence, as done in <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”. </ref>.<br />
<br />
'''Definition of auxiliary function''': A function G(h,h') is called an auxiliary function of F(h) if conditions; <math> G (h,h^') \ge F(h) </math> and <math> G (h,h) = F(h) </math> are satisfied. <br />
<br />
The auxiliary function is a useful concept because of the following lemma:<br />
<br><br />
<br />
'''Lemma:''' If G is an auxiliary function, then F is nonincreasing under the update <math>\mathbf{ h^{t+1} = \arg \min_h G(h,h^t)} </math><br />
<br />
[[File:auxiliary.jpeg|left|thumb|800px|Figure 1]]<br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
<br><br />
Adapted from <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
<br> That is, minimizing the auxiliary function <math> G(h,h^t) \ge F(h) </math> guarantees that <math> F(h^{t+1}) \le F(h^t) </math> for <math> \mathbf {h^{n+1} = \arg \min_h G(h, h^t) }</math> <ref name='S. S Lee'> Lee S. S and Seung S. H; “Algorithms for Non-negative Matrix Factorization”.</ref>.<br />
<br />
Therefore the authors of the paper, found an auxiliary function and its global minimum for the cost function of Semi NMF.<br />
<br />
The cost function for Semi NMF can be written as: <br />
<math> \mathbf {J(H) = Tr (-2H^TB^{+} + 2H^TB^{-} + HA^{+}H^T + HA^{-}H^T)} </math> where <math> A = F^TF , B = X^TF , H = G </math>. <br />
<br />
The auxiliary function of J (H) is: <br><br />
<math> Z(H,H') = -\sum_{ik}2{B_{ik}}^{+}H'_{ik}(1+ \log \frac {H_{ik}}{H'_{ik}}) + \sum_{ik} {B^-}_{ik} \frac {{H^2}_{ik}+{{H'}^2}_{ik}}{{H'}_{ik}} + \sum_{ik} \frac {(H'A^{+})_{ik}{H^2}_{ik}}{{H'}_{ik}} - \sum_{ik} {A_{kl}}^{-}{H'}_{ik}{H'}_{il} (1+ \log \frac {H_{ik}H_{il}}{H'_{ik}H'_{il}}) </math> <br />
<br />
Z (H, H') is convex in H and its global minimum is:<br><br />
<math> H_{ik} = arg \min_H Z(H,H') = H'_{ik}\sqrt {\frac {{B_{ik}}^{+} + (H'A^{-})_{ik}}{{B_{ik}}^{-} + (H'A^{+})_{ik}}} </math><br />
<br />
(The derivation of auxiliary function and its minimum can be found in the paper <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref>.)<br />
<br />
===Algorithm for Convex NMF===<br />
Here, again the factors G and W are computed iteratively by alternative updating until convergence.<br />
*'''Step 1''': Initialize G and W. There are two ways in which the initialization can be done.<br />
**'''K means clustering''': When K means clustering is done on the data set, cluster indicators <math> H = (h_1, \cdots , h_K) </math>are obtained. Then G is initialized to be equal to H. Then cluster centroids can be computed from H, as <math>\mathbf {f_k = Xh_k / n_k} </math> or <math> F=XH{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>. And as, in convex NMF: <math>F = XW </math> , we get <math> W=H{D_n}^{-1}</math> <br />
**'''Previous NMF or Semi NMF solution''': The factor G is known in this case and a least square solution for W is obtained by solving <math> X=XWG^T</math>. Therefore, <math> W=G(G^TG)^{-1} </math><br />
<br />
*'''Step 2''': Update G, while fixing W using the rule<br />
<math> G_{ik} \leftarrow G_{ik} \sqrt{\frac {[(X^TX)^+W]_{ik} + [GW^T(X^TX)^-W]_{ik}} {[(X^TX)^-W]_{ik} + [GW^T(X^TX)^+W]_{ik}} } </math><br />
*'''Step 3''': Update W, while fixing G using the rule<br />
<math> W_{ik} \leftarrow W_{ik} \sqrt{\frac {[(X^TX)^+G]_{ik} + [(X^TX)^-WG^TG]_{ik}} {[(X^TX)^-G]_{ik} + [(X^TX)^+WG^TG]_{ik}} } </math><br />
<br />
The objective function to be minimized for convex NMF is:<br />
<br />
<math> \mathbf {J=\|X-XWG^T\|^2= Tr(X^TX- 2G^TX^TXW + W^TX^TXWG^TG)} </math>.<br />
<br />
'''Theorem 2:''' Fixing W, under the update rule for G, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness and convergence of these rules is demonstrated in a manner similar to Semi NMF by replacing F=XW.<br />
<br />
'''Theorem 3:''' Fixing G, under the update rule for W, (A) the residual <math>\|X - XWG^T \|^2 </math> decreases monotonically (non-increasing), and (B) the solution converges to a KKT fixed point.<br />
<br />
The correctness is demonstrated by minimizing the objective function with respect to W and then obtaining KKT fixed point condition as:<br />
<br />
<math> \mathbf {(-X^TXG + X^TXWG^TG)_{ik}W_{ik} = 0 }</math><br />
<br />
<br> At convergence, the update rule for W can be shown to satisfy:<br />
<br />
<math>\mathbf { (-X^TXG + X^TXWG^TG)_{ik}{W_{ik}}^2 = 0 }</math><br />
<br />
<br> Therefore, the update rule for W satisfies KKT condition.<br><br />
<br />
Convergence of these rules is demonstrated in a manner similar to Semi NMF by finding an auxiliary function and its global minimum.<br />
<br />
==Sparsity of Convex NMF==<br />
<br />
NMF is shown to learn parts based representation and therefore has sparse factors. But there is no means to control the degree of sparseness and many sparsification methods have been applied to NMF in order to obtain better parts based representation <ref name='Ding C'> Ding C, Li. T, and Jordan I. M; “Convex and semi nonnegative matrix factorization” </ref> , <ref name='Simon D. H' > Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>. However, in contrast the authors of this paper show that factors of Convex NMF are naturally sparse.<br />
<br />
<br> The convex NMF problem can be written as:<br />
<br />
<math> \min_{W,G \ge 0}||X-XWG^T||^2 = ||X(I-WG^T)||^2= Tr (I-GW^T)X^TX(I-WG^T) </math><br />
<br />
<br> by SVD of <math> X </math> we have <math> X = U \Sigma V^T</math> and thus, <math> X^TX = \sum_k {\sigma _k}^2v_k{v_k}^T.</math><br />
<br />
<br> Therefore, <math> \min_{W,G \ge 0} Tr (I-GW^T)X^TX(I-WG^T) = \sum_k {\sigma_k}^2||{v_k}^T(I-WG^T)||^2 </math> s.t. <math>W \in {\mathbb R_+}^{n \times k} </math> , <math>G \in {\mathbb R_+}^{n \times k}</math><br />
<br />
They use the following Lemma to show that the above optimization problem gives sparse W and G.<br />
<br />
<br>'''Lemma:''' The solution of <math> \min_{W,G \ge 0}||I-WG^T||^2 </math> s.t. <math>W, G \in {\mathbb R_+}^{n \times K}</math> optimization problem is given by W = G = any K columns of (e1,…,eK), where ek is a basis vector. <math> (e_k)_{i \ne k} = 0 </math> , <math> (e_k)_{i = k} = 1 </math><br />
<br />
<br> According to this Lemma, the solution to <math> \min_{W,G \ge 0}\|I - WG^T\|^2 </math> are the sparsest possible rank-K matrices W and G.<br />
<br />
In the above equation, we can write: <math> \| I - WG^T \|^2 = \sum_k \|{e_k}^T (I - WG^T)\|^2 </math>.<br />
<br />
Therfore, projection of <math> ( I - WG^T ) </math> onto the principal components has more weight while its projection on non principal components has less weight. This implies that factors W and G are sparse in the principal component subspace and less sparse in the non principal component subspace.<br />
<br />
==Kernel NMF==<br />
Consider a mapping <math> \phi </math> that maps a point to a higher dimensional feature space, such that <math> \phi: x_i \rightarrow \phi(x_i)</math>. The factors for the kernel form of NMF or semi NMF : <math> \phi (X) = FG^T </math> would be difficult to compute as we need to know the mapping <math>\phi </math> explicitly.<br />
<br />
This difficulty is overcome in the convex NMF, as it has the form: <math> \phi: (X) = \phi (X) WG^T </math> and therefore the objective to be minimized becomes,<br />
<br> <math> \|\phi (X)-\phi(X)WG^T\|^2 = Tr (K-2G^TKW+W^TKWG^TG) </math> where <math> K = \phi^T(X)\phi(X) </math> is the kernel.<br />
<br />
Also, the update rules for the convex NMF algorithm (discussed above) depend only on <math> X^TX </math> and therefore convex NMF can be '''kernelized'''.<br />
<br />
==Cluster NMF==<br />
<br />
The factor G is considered to contain posterior cluster probabilities, then F, which represents cluster centroids is given as:<br />
<br> <math> \mathbf {f_k = Xg_k / n_k} </math> or <math> F = XG{D_n}^{-1}</math> where <math> D_n = diag (n_1, \cdots, n_K) </math>.<br />
<br>Therefore, the factorization becomes, <math> X = XG{D_n}^{-1}G^T </math> or <math> X = X G G^T </math>. This is because NMF is invariant to diagonal rescaling.<br />
<br />
This factorization is called Cluster NMF as it has the same degree of freedom as in any standard clustering problem, which is G (cluster indicator).<br />
<br />
==Relationship between NMF (its variants) and K means clustering==<br />
<br />
NMF and all its variants discussed above can be interpreted as K means clustering by imposing an additional constraint <math> G^TG=I </math>, that is in each row of G there is only one nonzero element, which implies each data point can belong to only one cluster.<br />
<br />
'''Theorem:''' G-orthogonal NMF, Semi NMF, Convex NMF, Cluster NMF and Kernel NMF are all relaxations of K means clustering.<br />
<br />
'''Proof:'''<br />
<br />
In all the above five cases of NMF, it can be shown that the objective function can be reduced to:<br />
<math> \mathbf {J = Tr(X^TX -G^TKG)} </math> when <math> G^TG = I </math> and where <math> K = X^TX </math> or <math> K = \phi^T(X)\phi(X) </math>. As the first term is a constant, the minimization problem actually becomes: <br><br />
<math> \max_{G^TG = I} Tr(G^TKG) </math><br />
<br />
The above objective function is the same as the objective function for kernel K means clustering <ref name='Simon D. H'> Simon D. H, Ding C, and He X; “On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering” </ref>.<br />
<br />
<br> Even without the orthogonality constraint, these NMF algorithms can be considered to be '''soft''' versions of K means clustering. That is each data point can be considered to fractionally belong to more than one cluster.<br />
<br />
==General properties of NMF algorithms==<br />
*Converge to local minimum and not global minimum.<br />
*NMF factors are invariant to rescaling i.e. degree of freedom of diagonal rescaling is always present.<br />
*Convergence rate of multiplicative algorithms is first order.<br />
*Many different ways to initialize NMF. Here, the relationship between NMF and relaxed K means clustering is used.<br />
<br />
==Experimental Results==<br />
<br />
The authors have presented experimental results on synthetic data set to show that factors given by Convex NMF more closely resemble cluster centroids than those given by Semi NMF. However, semi NMF results are better in terms of accuracy than convex NMF. They have even compared the results of NMF, convex NMF and semi NMF with K means clustering on real dataset. They conclude that all of these matrix factorizations give better results than K means on all of the datasets they studied in terms of clustering accuracy.<br />
<br />
=== A. Synthetic dataset ===<br />
One of the main goals in here is to show that the Convex-NMF variants may provide subspace factorizations that have more interpretable factors than those obtained by other NMF variants (or PCA). Particularly we expect that in some cases the factor F will be interpretable as containing<br />
cluster representatives (centroids) and G will be interpretable as encoding cluster indicators. <br />
<center>[[File:Convex-Fig1.JPG]]</center><br />
In Figure 1, we randomly generate four two-dimensional datasets with three clusters each. Computing both the Semi-NMF and Convex-NMF factorizations, we display the resulting F factors. We see that the Semi-NMF factors tend to lie distant from the cluster centroids. On the other hand, the Convex-NMF factors almost always lie within the clusters.<br />
<br />
=== B. Real life datasets ===<br />
The data sets which were used are: Ionosphere and Wave from the UCI repository, the document datasets URCS, WebkB4, Reuters (using a subset of the data collection which includes the 10 most frequent categories), WebAce and a dataset which contains 1367 log messages collected from several different machines with different operating systems at the School of Computer Science at Florida International University. The log messages are grouped into 9 categories: configuration, connection, create, dependency, other, report, request, start, and stop. Stop words were removed using a standard stop list. The top 1000 words were selected based on frequencies.<br />
<br />
<center>[[File:Convex-Table1.JPG]]</center><br />
<br />
The results are shown in Table I. We derived these results by averaging over 10 runs for each dataset and algorithm. Clustering accuracy was computed using the known class labels in the following way: The confusion matrix is first computed. The columns and rows are then reordered so as to maximize the sum of the diagonal. This sum is taken as a measure of the accuracy: it represents the percentage of data points correctly clustered under the optimized permutation. To measure the sparsity of G in the experiments, the average of each column of G was computed and all elements below 0.001 times the average were set to zero. We report the number of the remaining nonzero elements as a percentage of the total number of elements. (Thus small values of this measure correspond to large sparsity). We can observe that: <br />
<br />
1. Our main principal empirical result indicate that all of the matrix factorization models are better than K-means on all of the datasets. It states that the NMF family is competitive with K-means for the purposes of clustering. <br />
<br />
2. On most of the nonnegative datasets, NMF gives somewhat better accuracy than Semi-NMF and Convex-NMF (with WebKb4 the exception). The differences are modest, however, suggesting that the more highly-constrained Semi-NMF and Convex-NMF may be worthwhile options if<br />
interpretability is viewed as a goal of a data analysis. <br />
<br />
3. On the datasets containing both positive and negative values (where NMF is not applicable) the Semi-NMF results are better in terms<br />
of accuracy than the Convex-NMF results. <br />
<br />
4. In general, Convex-NMF solutions are sparse, while Semi-NMF solutions are not. <br />
<br />
5. Convex-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.<br />
<br />
<br />
=== C. Shifting mixed-sign data to nonnegative ===<br />
<br />
In this section we used only nonnegative by adding the smallest constant so all entries are nonnegative and performed experiments on data shifted in this way for the Wave and Ionosphere data. For Wave, the accuracy decreases to 0.503 from 0.590 for Semi-NMF and decreases to 0.5297 from 0.5738 for Convex-NMF. The sparsity increases to 0.586 from 0.498 for Convex-NMF. For Ionosphere, the accuracy decreases to 0.647 from 0.729 for Semi-NMF and decreases to 0.618 from 0.6877 for Convex-NMF. The sparsity increases to 0.829 from 0.498 for Convex-NMF. <br />
<br />
<center>[[File:Convex-Fig2.JPG]]</center><br />
<br />
In short, the shifting approach does not appear to provide a satisfactory alternative.<br />
<br />
=== D. Flexibility of NMF ===<br />
In general NMF almost always performs better than K-means in terms of clustering accuracy while providing a matrix approximation. This could be due to the flexibility of matrix factorization as compared to the rigid spherical clusters that the K-means clustering objective function attempts to capture. When the data distribution is far from a spherical clustering, NMF may have advantages. Figure 2 gives an example. The dataset consists of two parallel rods in 3D space containing 200 data points. The two central axes of the rods are 0.3 apart and they have diameter 0.1 and length 1. As seen in the figure, K-means gives a poor clustering, while NMF yields a good clustering. The bottom panel of Figure 2 shows the differences in the columns of G (each column is normalized to Pi gk(i) = 1). The mis-clustered points have small differences. Note that NMF is initialized randomly for the different runs. The stability of the solution over multiple runs was investigated; The results indicate that NMF converges to solutions F and G that are very similar across runs; moreover, the resulting discretized cluster indicators were identical.<br />
<br />
==Conclusion==<br />
In this paper: <br />
*Number of new NMF algorithms has been proposed which tend to extend the applications of the NMF.<br />
*They deal with mixed sign data.<br />
*The connection between NMF (its variants) and K means clustering was analyzed.<br />
*The matrix factors are shown to have convenient interpretation in terms of clustering.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=maximum-Margin_Matrix_Factorization&diff=3900maximum-Margin Matrix Factorization2009-08-14T01:09:46Z<p>Myakhave: /* A short discussion on Loss Function for classification */</p>
<hr />
<div>== Problem Definition ==<br />
<br />
Assume Y is an <math>n \times m</math> matrix containing n user preferences about m movies, such that <math>y_{ij} = +1</math> if user i likes movie j, and <math>y_{ij} = -1</math> if he/she dislikes it. Due to the lack of knowledge about the user's opinions, Y is partially observable: it has some <math>\pm1</math> while some other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.<br />
<br />
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; In other words, the rank of X indicates the number of ''features'' affecting the values in Y. Therefore, minimizing the number of the features or the rank of X is equivalent to finding a ''simple relation'' in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the prediction process. If the rank of X is equal to the rank of Y, we will have the trivial solution of X = Y.<br />
<br />
== Hinge Loss Function ==<br />
<br />
If Y has been fully observed and there is no unknown, our goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact that this solution is suitable for lowering the rank, it does not work when we have unknowns in Y (due to several local minimas problem). In addition, one may think about other loss functions instead of SSE (although SSE is ''convex'' and has a nice behavior in the optimization problems). For example, in this problem, '''Hinge loss''' works fine, specially considering the fact that values in Y are restricted to <math>\pm1</math>. Here, hinge loss is defined as: <br />
<br />
<math>\textrm{Hinge}(X | Y) = \displaystyle \sum_{(i,j) \in S} \max (0, 1 - x_{ij} \cdot y_{ij})</math><br />
(where <math>\, S = \{(i, j) | y_{ij}</math> is known<math>\}</math>)<br />
<br />
== Prediction Using Matrix Factorization ==<br />
<br />
Another way to tackle this problem is to use '''Matrix Factorization'''. In matrix factorization one tries to find a prediction <math>X = UV^T</math> such that the rank of each factor (U and V) is low. In addition to this U and V are meaningful in this problem; the i<sub>th</sub> row of U is an indicator of the importance (weights) of the features for i<sub>th</sub> user, therefore i<sub>th</sub> row of V is the characterization of the i<sub>th</sub> movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem in this case is that, rank optimization problem is not ''convex''.<br />
<br />
=== A short discussion on Loss Function for classification===<br />
<br />
Collaborative prediction, as described in the problem definition, involves classifying each entry of a matrix into either 1 or -1. Consider the notation in the last paragraph and suppose that we are to predict the j<sub>th</sub> row of the matrix X. Suppose further that the matrix U is fixed, then the classification problem boils down to finding the j<sub>th</sub> row of V which gives the ''optimal prediction''. Denoting the j<sub>th</sub> row of V by vector <math>v</math>, the matrix entry <math>X_{ij}</math> by real number <math>x</math>, and the i<sub>th</sub> row of U by vector <math>u</math>, then the classification problem can be rephrased as finding the weight vector <math>v</math> and predict the value of <math>x</math> (which is either 1 or -1) by the dot production function <math>f(v)=vu^T</math>. Usually, we specify a threshold value (for example, take the threshold as 0) and classify <math>x</math> into 1 or -1 as follows: if <math>f(v)</math> is greater than the threshold, then classify <math>x</math> as 1; otherwise classify <math>x</math> as -1.<br />
<br />
There are many ways to measure how well the above classification scheme performs(in the training stage). The most natural way is to calculate the number of incorrect classification: when the true value is 1 but the classification scheme gives -1; or vice versa. However, this very natural performance measure gives rise to a very intractable optimization problem. To make the optimization more tractable, several popular ''proxy loss functions'' are used to replace the above measurement in measuring the performance of the classification scheme. These proxy loss functions include the log loss function, the squared loss function and the hinge loss function.<br />
<br />
A comparison of these three proxy loss functions is available at http://hunch.net/?p=547. This article suggests not using log loss since the optimisation of log los can be unstable.<br />
<br />
== Frobenius Norm ==<br />
<br />
Instead of using the rank of each factor, we can use '''Frobenius Norm''' of each factor to address this problem; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get the benefit of using common optimization techniques. The Frobenius norm is defined as:<br />
<br />
<math>\|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2</math> <br />
(where <math>Tr()</math> is the ''Trace Function'' and <math>\lambda</math>s are singular values of X)<br />
<br />
== Notion of Margin in Collaborative Filtering ==<br />
<br />
Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM. Assume U is fixed, so predicting each column of X is in fact, solving a SVM for U to find a row in V. Recall that in SVM, to maximize the margin the norm of the linear separator <math>\|\beta\|^2</math> should be minimized. Therefore, predicting Y with the maximum margin is equivalent to minimizing the norm of the factor V or <math>\|V\|_{F}^2</math>. However, the problem here is to predict both U and V together (which is called '''collaborative filtering'''), but at each step, one factor can assumed to be fixed, so minimizing the norm of the other factor gives the maximum margin for this prediction.<br />
<br />
== The Optimization Problem and Trace Norm==<br />
<br />
So far it has been shown that the optimization problem is to find the factors with minimum norm such that the prediction has a low loss:<br />
<br />
<math>\displaystyle \min_{X = UV^T} (\|U\|_{F}^2 + \|V\|_{F}^2) + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
This optimization problem is difficult because both objective function and constraints are not linear. The following lemma helps to change the shape of the optimization problem in a way that it becomes easier to solve:<br />
<br />
'''Lemma 1:''' <br /><br />
<math>\displaystyle \min_{X = UV^T} \frac{1}{2}(\|U\|_{F}^2 + \|V\|_{F}^2) = <br />
\displaystyle \min_{X = UV^T} (\|U\|_{F} \cdot \|V\|_{F}) = \|X\|_{T} </math><br />
<br />
where <math>\|X\|_{T}</math> is the '''Trace Norm''' of X and is defined as: <br /><br />
<math> \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T})</math> <br /><br />
In addition, by using SVD of <math>X = A \Sigma B^T</math>, one can see that both <math>U = A \sqrt{\Sigma}</math> and <br />
<math>V = B\sqrt{\Sigma}</math> satisfy this lemma.<br />
<br />
Based on the lemma 1 the optimization problem can be reformulated as:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
== Relation Between Rank And Trace Norm ==<br />
<br />
In the literature, the notion of trace norm has been widely used instead of dealing with the rank of matrices. The next theorem explains the relation between the rank and the trace norm:<br />
<br />
'''Theorem 1:'''<br /><br />
The convex envelope (smallest convex bounding function) of the rank function, on matrices with unit spectral norm, is the trace norm function. <br /><br />
''Spectral Norm'' of a matrix is the absolute value of its largest eigenvalue.<br />
<br />
In addition it can be easily be shown that the relation between the norm, trace norm and rank of a matrix is as follows:<br />
<br />
<math>\forall X: \|X\|_{F} \leq \|X\|_{T} \leq \sqrt{Rank(X)} \cdot \|X\|_{F}</math><br />
<br />
Based on this relation, and the fact that the trace norm is a convex function, it can be shown:<br />
<br />
<math>\{X| \|X\|_{F} \leq \alpha\} = conv \{uv^T| u \in \Re^n , v \in \Re^m , |u| = |v| = \alpha\}</math><br />
<br />
which shows that the set of matrices with a bounded trace norm is Convex; also it seems that this set has the lowest rank matrices on its boundary. Thus, this optimization problem is in fact the process of searching in a convex set to optimize a convex function.<br />
<br />
== Soft Margin Optimization ==<br />
<br />
As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math> ''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
== Using Semi-Definite Programming ==<br />
<br />
Now this optimization problem is easy to understand, and also it is convex with linear constrains. To solve, it should be reformulated to one of the known convex optimization problems. Next lemma shows how ''Semi-Definite Programming'' can be used to solve this optimization:<br />
<br />
'''Lemma 2:'''<br /><br />
<math>\forall X \in \Re^{n \cdot m}</math> and <math>t \in \Re : \|X\|_{T} \leq t \Longleftrightarrow \exists A, B</math> '''s.t.''' <br />
<math> M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right) \succeq 0</math> and <math>Tr(M) \leq 2t</math><br />
<br />
Therefore the last optimization problem can be formulated as a linear, convex SDP optimization problem:<br />
<br />
<math>\displaystyle \min_{M \succeq 0} Tr(M) + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right)\succeq 0</math> <br />
'''and''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math>''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
which has a dual form with a simple structure. Also the prediction can be done using a solution to the dual problem directly.<br />
<br />
== Experiments ==<br />
<br />
Preliminary experiments was performed on a subset of the 100K MovieLens Dataset, consisting of the 100 users and 100 movies with the most ratings. To do this, we used CSDP to solve the resulting SDPs. The ratings are on a discrete scale of one through five, and we experimented with both generalizations of the hinge loss above, allowing per-user thresholds. As the "base line" learners, we used WLRA and K-Medians. <br />
The the data was split into four sets. For each of the these four test sets, we used the remaining sets to calculate a 3-fold cross-validation (CV) error for each method (WLRA, K-medians, trace norm and max-norm MMMF with immediate-threshold and allthreshold<br />
hinge loss) using a range of parameters (rank for WLRA, number of centers for K-medians, slack cost for MMMF). For each of the four splits, we selected the two MMMF learners with lowest CV ZOE andMAE and the two Baseline learners with lowest CV ZOE and MAE, and measured their error on the held-out test data. <br />
[[File:Paer4-Table1.JPG]]<br />
<br />
Table 1 lists these CV and test errors, and the average test error across all four test sets. On average and on three of the four test sets, MMMF achieves lower MAE than the Baseline learners; on all four of the test sets, MMMF achieves lower ZOE than the Baseline learners.<br />
<br />
== Limitation ==<br />
It is unrealistic that observed entries are assumed to be uniformly sampled. For example, Users tend to rate items they like. In fact, allowing an uncontrolled sampling distribution would guarantee low error on items the user likes, but not on items he would really like based on our prediction.</div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=maximum-Margin_Matrix_Factorization&diff=3899maximum-Margin Matrix Factorization2009-08-14T00:37:33Z<p>Myakhave: /* Hinge Loss Function */</p>
<hr />
<div>== Problem Definition ==<br />
<br />
Assume Y is an <math>n \times m</math> matrix containing n user preferences about m movies, such that <math>y_{ij} = +1</math> if user i likes movie j, and <math>y_{ij} = -1</math> if he/she dislikes it. Due to the lack of knowledge about the user's opinions, Y is partially observable: it has some <math>\pm1</math> while some other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.<br />
<br />
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; In other words, the rank of X indicates the number of ''features'' affecting the values in Y. Therefore, minimizing the number of the features or the rank of X is equivalent to finding a ''simple relation'' in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the prediction process. If the rank of X is equal to the rank of Y, we will have the trivial solution of X = Y.<br />
<br />
== Hinge Loss Function ==<br />
<br />
If Y has been fully observed and there is no unknown, our goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact that this solution is suitable for lowering the rank, it does not work when we have unknowns in Y (due to several local minimas problem). In addition, one may think about other loss functions instead of SSE (although SSE is ''convex'' and has a nice behavior in the optimization problems). For example, in this problem, '''Hinge loss''' works fine, specially considering the fact that values in Y are restricted to <math>\pm1</math>. Here, hinge loss is defined as: <br />
<br />
<math>\textrm{Hinge}(X | Y) = \displaystyle \sum_{(i,j) \in S} \max (0, 1 - x_{ij} \cdot y_{ij})</math><br />
(where <math>\, S = \{(i, j) | y_{ij}</math> is known<math>\}</math>)<br />
<br />
== Prediction Using Matrix Factorization ==<br />
<br />
Another way to tackle this problem is to use '''Matrix Factorization'''. In matrix factorization one tries to find a prediction <math>X = UV^T</math> such that the rank of each factor (U and V) is low. In addition to this U and V are meaningful in this problem; the i<sub>th</sub> row of U is an indicator of the importance (weights) of the features for i<sub>th</sub> user, therefore i<sub>th</sub> row of V is the characterization of the i<sub>th</sub> movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem in this case is that, rank optimization problem is not ''convex''.<br />
<br />
=== A short discussion on Loss Function for classification===<br />
<br />
Collaborative prediction, as described in the problem definition, involves classifying each entry of a matrix into either 1 or -1. Consider the notation in the last paragraph and suppose that we are to predict the j<sub>th</sub> row of the matrix X. Suppose further that the matrix U is fixed, then the classification problem boils down to finding the j<sub>th</sub> row of V which gives the ''optimal prediction''. Denoting the j<sub>th</sub> row of V by vector <math>v</math>, the matrix entry <math>X_{ij}</math> by real number <math>x</math>, and the i<sub>th</sub> row of U by vector <math>u</math>, then the classification problem can be rephrased as finding the weight vector <math>v</math> and predict the value of <math>x</math> (which is either 1 or -1) by the dot production function <math>f(v)=vu^T</math>. Usually, we specify a threshold value (for example, take the threshold as 0) and classify <math>x</math> into 1 or -1 as follows: if <math>f(v)</math> is greater than the threshold, then classify <math>x</math> as 1; otherwise classify <math>x</math> as -1.<br />
<br />
There are many ways to measure how well the above classification scheme performs(in the training stage). The most natural way is to calculate the number of incorrect classification: when the true value is 1 but the classification scheme gives -1; or vice versa. However, this very natural performance measure gives rise to a very intractable optimization problem. To make the optimization more tractable, several popular ''proxy loss functions'' are used to replace the above measurement in measuring the performance of the classification scheme. These proxy loss functions include the log loss function, the squared loss function and the hinge loss function.<br />
<br />
A comparison of these three proxy loss functions is available at http://hunch.net/?p=547.<br />
<br />
== Frobenius Norm ==<br />
<br />
Instead of using the rank of each factor, we can use '''Frobenius Norm''' of each factor to address this problem; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get the benefit of using common optimization techniques. The Frobenius norm is defined as:<br />
<br />
<math>\|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2</math> <br />
(where <math>Tr()</math> is the ''Trace Function'' and <math>\lambda</math>s are singular values of X)<br />
<br />
== Notion of Margin in Collaborative Filtering ==<br />
<br />
Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM. Assume U is fixed, so predicting each column of X is in fact, solving a SVM for U to find a row in V. Recall that in SVM, to maximize the margin the norm of the linear separator <math>\|\beta\|^2</math> should be minimized. Therefore, predicting Y with the maximum margin is equivalent to minimizing the norm of the factor V or <math>\|V\|_{F}^2</math>. However, the problem here is to predict both U and V together (which is called '''collaborative filtering'''), but at each step, one factor can assumed to be fixed, so minimizing the norm of the other factor gives the maximum margin for this prediction.<br />
<br />
== The Optimization Problem and Trace Norm==<br />
<br />
So far it has been shown that the optimization problem is to find the factors with minimum norm such that the prediction has a low loss:<br />
<br />
<math>\displaystyle \min_{X = UV^T} (\|U\|_{F}^2 + \|V\|_{F}^2) + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
This optimization problem is difficult because both objective function and constraints are not linear. The following lemma helps to change the shape of the optimization problem in a way that it becomes easier to solve:<br />
<br />
'''Lemma 1:''' <br /><br />
<math>\displaystyle \min_{X = UV^T} \frac{1}{2}(\|U\|_{F}^2 + \|V\|_{F}^2) = <br />
\displaystyle \min_{X = UV^T} (\|U\|_{F} \cdot \|V\|_{F}) = \|X\|_{T} </math><br />
<br />
where <math>\|X\|_{T}</math> is the '''Trace Norm''' of X and is defined as: <br /><br />
<math> \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T})</math> <br /><br />
In addition, by using SVD of <math>X = A \Sigma B^T</math>, one can see that both <math>U = A \sqrt{\Sigma}</math> and <br />
<math>V = B\sqrt{\Sigma}</math> satisfy this lemma.<br />
<br />
Based on the lemma 1 the optimization problem can be reformulated as:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
== Relation Between Rank And Trace Norm ==<br />
<br />
In the literature, the notion of trace norm has been widely used instead of dealing with the rank of matrices. The next theorem explains the relation between the rank and the trace norm:<br />
<br />
'''Theorem 1:'''<br /><br />
The convex envelope (smallest convex bounding function) of the rank function, on matrices with unit spectral norm, is the trace norm function. <br /><br />
''Spectral Norm'' of a matrix is the absolute value of its largest eigenvalue.<br />
<br />
In addition it can be easily be shown that the relation between the norm, trace norm and rank of a matrix is as follows:<br />
<br />
<math>\forall X: \|X\|_{F} \leq \|X\|_{T} \leq \sqrt{Rank(X)} \cdot \|X\|_{F}</math><br />
<br />
Based on this relation, and the fact that the trace norm is a convex function, it can be shown:<br />
<br />
<math>\{X| \|X\|_{F} \leq \alpha\} = conv \{uv^T| u \in \Re^n , v \in \Re^m , |u| = |v| = \alpha\}</math><br />
<br />
which shows that the set of matrices with a bounded trace norm is Convex; also it seems that this set has the lowest rank matrices on its boundary. Thus, this optimization problem is in fact the process of searching in a convex set to optimize a convex function.<br />
<br />
== Soft Margin Optimization ==<br />
<br />
As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math> ''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
== Using Semi-Definite Programming ==<br />
<br />
Now this optimization problem is easy to understand, and also it is convex with linear constrains. To solve, it should be reformulated to one of the known convex optimization problems. Next lemma shows how ''Semi-Definite Programming'' can be used to solve this optimization:<br />
<br />
'''Lemma 2:'''<br /><br />
<math>\forall X \in \Re^{n \cdot m}</math> and <math>t \in \Re : \|X\|_{T} \leq t \Longleftrightarrow \exists A, B</math> '''s.t.''' <br />
<math> M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right) \succeq 0</math> and <math>Tr(M) \leq 2t</math><br />
<br />
Therefore the last optimization problem can be formulated as a linear, convex SDP optimization problem:<br />
<br />
<math>\displaystyle \min_{M \succeq 0} Tr(M) + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right)\succeq 0</math> <br />
'''and''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math>''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
which has a dual form with a simple structure. Also the prediction can be done using a solution to the dual problem directly.<br />
<br />
== Experiments ==<br />
<br />
Preliminary experiments was performed on a subset of the 100K MovieLens Dataset, consisting of the 100 users and 100 movies with the most ratings. To do this, we used CSDP to solve the resulting SDPs. The ratings are on a discrete scale of one through five, and we experimented with both generalizations of the hinge loss above, allowing per-user thresholds. As the "base line" learners, we used WLRA and K-Medians. <br />
The the data was split into four sets. For each of the these four test sets, we used the remaining sets to calculate a 3-fold cross-validation (CV) error for each method (WLRA, K-medians, trace norm and max-norm MMMF with immediate-threshold and allthreshold<br />
hinge loss) using a range of parameters (rank for WLRA, number of centers for K-medians, slack cost for MMMF). For each of the four splits, we selected the two MMMF learners with lowest CV ZOE andMAE and the two Baseline learners with lowest CV ZOE and MAE, and measured their error on the held-out test data. <br />
[[File:Paer4-Table1.JPG]]<br />
<br />
Table 1 lists these CV and test errors, and the average test error across all four test sets. On average and on three of the four test sets, MMMF achieves lower MAE than the Baseline learners; on all four of the test sets, MMMF achieves lower ZOE than the Baseline learners.<br />
<br />
== Limitation ==<br />
It is unrealistic that observed entries are assumed to be uniformly sampled. For example, Users tend to rate items they like. In fact, allowing an uncontrolled sampling distribution would guarantee low error on items the user likes, but not on items he would really like based on our prediction.</div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=maximum-Margin_Matrix_Factorization&diff=3898maximum-Margin Matrix Factorization2009-08-14T00:32:06Z<p>Myakhave: /* Limitation */</p>
<hr />
<div>== Problem Definition ==<br />
<br />
Assume Y is an <math>n \times m</math> matrix containing n user preferences about m movies, such that <math>y_{ij} = +1</math> if user i likes movie j, and <math>y_{ij} = -1</math> if he/she dislikes it. Due to the lack of knowledge about the user's opinions, Y is partially observable: it has some <math>\pm1</math> while some other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.<br />
<br />
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; In other words, the rank of X indicates the number of ''features'' affecting the values in Y. Therefore, minimizing the number of the features or the rank of X is equivalent to finding a ''simple relation'' in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the prediction process. If the rank of X is equal to the rank of Y, we will have the trivial solution of X = Y.<br />
<br />
== Hinge Loss Function ==<br />
<br />
If Y has been fully observed and there is no unknown, our goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact that this solution is suitable for lowering the rank, it does not work when we have unknowns in Y (due to several local minimas problem). In addition one may think about other loss functions instead of SSE (although SSE is ''convex'' and has a nice behavior in the optimization problems). For example, in this problem '''Hinge loss''' works fine, specially considering the fact that values in Y are restricted to <math>\pm1</math>. Here hinge loss is defined as: <br />
<br />
<math>\textrm{Hinge}(X | Y) = \displaystyle \sum_{(i,j) \in S} \max (0, 1 - x_{ij} \cdot y_{ij})</math><br />
(where <math>\, S = \{(i, j) | y_{ij}</math> is known<math>\}</math>)<br />
<br />
== Prediction Using Matrix Factorization ==<br />
<br />
Another way to tackle this problem is to use '''Matrix Factorization'''. In matrix factorization one tries to find a prediction <math>X = UV^T</math> such that the rank of each factor (U and V) is low. In addition to this U and V are meaningful in this problem; the i<sub>th</sub> row of U is an indicator of the importance (weights) of the features for i<sub>th</sub> user, therefore i<sub>th</sub> row of V is the characterization of the i<sub>th</sub> movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem in this case is that, rank optimization problem is not ''convex''.<br />
<br />
=== A short discussion on Loss Function for classification===<br />
<br />
Collaborative prediction, as described in the problem definition, involves classifying each entry of a matrix into either 1 or -1. Consider the notation in the last paragraph and suppose that we are to predict the j<sub>th</sub> row of the matrix X. Suppose further that the matrix U is fixed, then the classification problem boils down to finding the j<sub>th</sub> row of V which gives the ''optimal prediction''. Denoting the j<sub>th</sub> row of V by vector <math>v</math>, the matrix entry <math>X_{ij}</math> by real number <math>x</math>, and the i<sub>th</sub> row of U by vector <math>u</math>, then the classification problem can be rephrased as finding the weight vector <math>v</math> and predict the value of <math>x</math> (which is either 1 or -1) by the dot production function <math>f(v)=vu^T</math>. Usually, we specify a threshold value (for example, take the threshold as 0) and classify <math>x</math> into 1 or -1 as follows: if <math>f(v)</math> is greater than the threshold, then classify <math>x</math> as 1; otherwise classify <math>x</math> as -1.<br />
<br />
There are many ways to measure how well the above classification scheme performs(in the training stage). The most natural way is to calculate the number of incorrect classification: when the true value is 1 but the classification scheme gives -1; or vice versa. However, this very natural performance measure gives rise to a very intractable optimization problem. To make the optimization more tractable, several popular ''proxy loss functions'' are used to replace the above measurement in measuring the performance of the classification scheme. These proxy loss functions include the log loss function, the squared loss function and the hinge loss function.<br />
<br />
A comparison of these three proxy loss functions is available at http://hunch.net/?p=547.<br />
<br />
== Frobenius Norm ==<br />
<br />
Instead of using the rank of each factor, we can use '''Frobenius Norm''' of each factor to address this problem; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get the benefit of using common optimization techniques. The Frobenius norm is defined as:<br />
<br />
<math>\|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2</math> <br />
(where <math>Tr()</math> is the ''Trace Function'' and <math>\lambda</math>s are singular values of X)<br />
<br />
== Notion of Margin in Collaborative Filtering ==<br />
<br />
Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM. Assume U is fixed, so predicting each column of X is in fact, solving a SVM for U to find a row in V. Recall that in SVM, to maximize the margin the norm of the linear separator <math>\|\beta\|^2</math> should be minimized. Therefore, predicting Y with the maximum margin is equivalent to minimizing the norm of the factor V or <math>\|V\|_{F}^2</math>. However, the problem here is to predict both U and V together (which is called '''collaborative filtering'''), but at each step, one factor can assumed to be fixed, so minimizing the norm of the other factor gives the maximum margin for this prediction.<br />
<br />
== The Optimization Problem and Trace Norm==<br />
<br />
So far it has been shown that the optimization problem is to find the factors with minimum norm such that the prediction has a low loss:<br />
<br />
<math>\displaystyle \min_{X = UV^T} (\|U\|_{F}^2 + \|V\|_{F}^2) + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
This optimization problem is difficult because both objective function and constraints are not linear. The following lemma helps to change the shape of the optimization problem in a way that it becomes easier to solve:<br />
<br />
'''Lemma 1:''' <br /><br />
<math>\displaystyle \min_{X = UV^T} \frac{1}{2}(\|U\|_{F}^2 + \|V\|_{F}^2) = <br />
\displaystyle \min_{X = UV^T} (\|U\|_{F} \cdot \|V\|_{F}) = \|X\|_{T} </math><br />
<br />
where <math>\|X\|_{T}</math> is the '''Trace Norm''' of X and is defined as: <br /><br />
<math> \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T})</math> <br /><br />
In addition, by using SVD of <math>X = A \Sigma B^T</math>, one can see that both <math>U = A \sqrt{\Sigma}</math> and <br />
<math>V = B\sqrt{\Sigma}</math> satisfy this lemma.<br />
<br />
Based on the lemma 1 the optimization problem can be reformulated as:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
== Relation Between Rank And Trace Norm ==<br />
<br />
In the literature, the notion of trace norm has been widely used instead of dealing with the rank of matrices. The next theorem explains the relation between the rank and the trace norm:<br />
<br />
'''Theorem 1:'''<br /><br />
The convex envelope (smallest convex bounding function) of the rank function, on matrices with unit spectral norm, is the trace norm function. <br /><br />
''Spectral Norm'' of a matrix is the absolute value of its largest eigenvalue.<br />
<br />
In addition it can be easily be shown that the relation between the norm, trace norm and rank of a matrix is as follows:<br />
<br />
<math>\forall X: \|X\|_{F} \leq \|X\|_{T} \leq \sqrt{Rank(X)} \cdot \|X\|_{F}</math><br />
<br />
Based on this relation, and the fact that the trace norm is a convex function, it can be shown:<br />
<br />
<math>\{X| \|X\|_{F} \leq \alpha\} = conv \{uv^T| u \in \Re^n , v \in \Re^m , |u| = |v| = \alpha\}</math><br />
<br />
which shows that the set of matrices with a bounded trace norm is Convex; also it seems that this set has the lowest rank matrices on its boundary. Thus, this optimization problem is in fact the process of searching in a convex set to optimize a convex function.<br />
<br />
== Soft Margin Optimization ==<br />
<br />
As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math> ''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
== Using Semi-Definite Programming ==<br />
<br />
Now this optimization problem is easy to understand, and also it is convex with linear constrains. To solve, it should be reformulated to one of the known convex optimization problems. Next lemma shows how ''Semi-Definite Programming'' can be used to solve this optimization:<br />
<br />
'''Lemma 2:'''<br /><br />
<math>\forall X \in \Re^{n \cdot m}</math> and <math>t \in \Re : \|X\|_{T} \leq t \Longleftrightarrow \exists A, B</math> '''s.t.''' <br />
<math> M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right) \succeq 0</math> and <math>Tr(M) \leq 2t</math><br />
<br />
Therefore the last optimization problem can be formulated as a linear, convex SDP optimization problem:<br />
<br />
<math>\displaystyle \min_{M \succeq 0} Tr(M) + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right)\succeq 0</math> <br />
'''and''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math>''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
which has a dual form with a simple structure. Also the prediction can be done using a solution to the dual problem directly.<br />
<br />
== Experiments ==<br />
<br />
Preliminary experiments was performed on a subset of the 100K MovieLens Dataset, consisting of the 100 users and 100 movies with the most ratings. To do this, we used CSDP to solve the resulting SDPs. The ratings are on a discrete scale of one through five, and we experimented with both generalizations of the hinge loss above, allowing per-user thresholds. As the "base line" learners, we used WLRA and K-Medians. <br />
The the data was split into four sets. For each of the these four test sets, we used the remaining sets to calculate a 3-fold cross-validation (CV) error for each method (WLRA, K-medians, trace norm and max-norm MMMF with immediate-threshold and allthreshold<br />
hinge loss) using a range of parameters (rank for WLRA, number of centers for K-medians, slack cost for MMMF). For each of the four splits, we selected the two MMMF learners with lowest CV ZOE andMAE and the two Baseline learners with lowest CV ZOE and MAE, and measured their error on the held-out test data. <br />
[[File:Paer4-Table1.JPG]]<br />
<br />
Table 1 lists these CV and test errors, and the average test error across all four test sets. On average and on three of the four test sets, MMMF achieves lower MAE than the Baseline learners; on all four of the test sets, MMMF achieves lower ZOE than the Baseline learners.<br />
<br />
== Limitation ==<br />
It is unrealistic that observed entries are assumed to be uniformly sampled. For example, Users tend to rate items they like. In fact, allowing an uncontrolled sampling distribution would guarantee low error on items the user likes, but not on items he would really like based on our prediction.</div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=maximum-Margin_Matrix_Factorization&diff=3897maximum-Margin Matrix Factorization2009-08-13T23:23:39Z<p>Myakhave: /* Experiments */</p>
<hr />
<div>== Problem Definition ==<br />
<br />
Assume Y is an <math>n \times m</math> matrix containing n user preferences about m movies, such that <math>y_{ij} = +1</math> if user i likes movie j, and <math>y_{ij} = -1</math> if he/she dislikes it. Due to the lack of knowledge about the user's opinions, Y is partially observable: it has some <math>\pm1</math> while some other cells are unknown. The main goal is to find matrix X such than it preserves the knowledge in Y, and predicts the value of its unknown cells.<br />
<br />
Predicting the unknown values in this problem is possible because the rows and columns of Y are assumed to be related to each other. One can translate this ''relation'' by the rank of X; In other words, the rank of X indicates the number of ''features'' affecting the values in Y. Therefore, minimizing the number of the features or the rank of X is equivalent to finding a ''simple relation'' in the given knowledge of Y. In addition, keeping the number of features low is one way to avoid the problem of over-fitting in the prediction process. If the rank of X is equal to the rank of Y, we will have the trivial solution of X = Y.<br />
<br />
== Hinge Loss Function ==<br />
<br />
If Y has been fully observed and there is no unknown, our goal is to find X as a simple (low-rank) representation of Y. By choosing the sum-squared error loss function, one can easily use SVD technique to minimize the loss and find X. Despite the fact that this solution is suitable for lowering the rank, it does not work when we have unknowns in Y (due to several local minimas problem). In addition one may think about other loss functions instead of SSE (although SSE is ''convex'' and has a nice behavior in the optimization problems). For example, in this problem '''Hinge loss''' works fine, specially considering the fact that values in Y are restricted to <math>\pm1</math>. Here hinge loss is defined as: <br />
<br />
<math>\textrm{Hinge}(X | Y) = \displaystyle \sum_{(i,j) \in S} \max (0, 1 - x_{ij} \cdot y_{ij})</math><br />
(where <math>\, S = \{(i, j) | y_{ij}</math> is known<math>\}</math>)<br />
<br />
== Prediction Using Matrix Factorization ==<br />
<br />
Another way to tackle this problem is to use '''Matrix Factorization'''. In matrix factorization one tries to find a prediction <math>X = UV^T</math> such that the rank of each factor (U and V) is low. In addition to this U and V are meaningful in this problem; the i<sub>th</sub> row of U is an indicator of the importance (weights) of the features for i<sub>th</sub> user, therefore i<sub>th</sub> row of V is the characterization of the i<sub>th</sub> movie based on its features. Here, lowering the rank of X is equal to lowering the rank of its factors, but unfortunately the main problem in this case is that, rank optimization problem is not ''convex''.<br />
<br />
=== A short discussion on Loss Function for classification===<br />
<br />
Collaborative prediction, as described in the problem definition, involves classifying each entry of a matrix into either 1 or -1. Consider the notation in the last paragraph and suppose that we are to predict the j<sub>th</sub> row of the matrix X. Suppose further that the matrix U is fixed, then the classification problem boils down to finding the j<sub>th</sub> row of V which gives the ''optimal prediction''. Denoting the j<sub>th</sub> row of V by vector <math>v</math>, the matrix entry <math>X_{ij}</math> by real number <math>x</math>, and the i<sub>th</sub> row of U by vector <math>u</math>, then the classification problem can be rephrased as finding the weight vector <math>v</math> and predict the value of <math>x</math> (which is either 1 or -1) by the dot production function <math>f(v)=vu^T</math>. Usually, we specify a threshold value (for example, take the threshold as 0) and classify <math>x</math> into 1 or -1 as follows: if <math>f(v)</math> is greater than the threshold, then classify <math>x</math> as 1; otherwise classify <math>x</math> as -1.<br />
<br />
There are many ways to measure how well the above classification scheme performs(in the training stage). The most natural way is to calculate the number of incorrect classification: when the true value is 1 but the classification scheme gives -1; or vice versa. However, this very natural performance measure gives rise to a very intractable optimization problem. To make the optimization more tractable, several popular ''proxy loss functions'' are used to replace the above measurement in measuring the performance of the classification scheme. These proxy loss functions include the log loss function, the squared loss function and the hinge loss function.<br />
<br />
A comparison of these three proxy loss functions is available at http://hunch.net/?p=547.<br />
<br />
== Frobenius Norm ==<br />
<br />
Instead of using the rank of each factor, we can use '''Frobenius Norm''' of each factor to address this problem; in fact, this norm is closely related to the rank of matrix. Moreover, in this way the optimization problem will be convex, and one may get the benefit of using common optimization techniques. The Frobenius norm is defined as:<br />
<br />
<math>\|X\|_{F}^2 = \sum x_{ij}^2 = Tr(XX^T) = \sum \lambda_i^2</math> <br />
(where <math>Tr()</math> is the ''Trace Function'' and <math>\lambda</math>s are singular values of X)<br />
<br />
== Notion of Margin in Collaborative Filtering ==<br />
<br />
Assume one of the factors is fixed and the goal is to predict the other factor. This is the famous problem of linear prediction or SVM. Assume U is fixed, so predicting each column of X is in fact, solving a SVM for U to find a row in V. Recall that in SVM, to maximize the margin the norm of the linear separator <math>\|\beta\|^2</math> should be minimized. Therefore, predicting Y with the maximum margin is equivalent to minimizing the norm of the factor V or <math>\|V\|_{F}^2</math>. However, the problem here is to predict both U and V together (which is called '''collaborative filtering'''), but at each step, one factor can assumed to be fixed, so minimizing the norm of the other factor gives the maximum margin for this prediction.<br />
<br />
== The Optimization Problem and Trace Norm==<br />
<br />
So far it has been shown that the optimization problem is to find the factors with minimum norm such that the prediction has a low loss:<br />
<br />
<math>\displaystyle \min_{X = UV^T} (\|U\|_{F}^2 + \|V\|_{F}^2) + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
This optimization problem is difficult because both objective function and constraints are not linear. The following lemma helps to change the shape of the optimization problem in a way that it becomes easier to solve:<br />
<br />
'''Lemma 1:''' <br /><br />
<math>\displaystyle \min_{X = UV^T} \frac{1}{2}(\|U\|_{F}^2 + \|V\|_{F}^2) = <br />
\displaystyle \min_{X = UV^T} (\|U\|_{F} \cdot \|V\|_{F}) = \|X\|_{T} </math><br />
<br />
where <math>\|X\|_{T}</math> is the '''Trace Norm''' of X and is defined as: <br /><br />
<math> \|X\|_{T} = \sum |\lambda_i| = Tr(\sqrt{XX^T})</math> <br /><br />
In addition, by using SVD of <math>X = A \Sigma B^T</math>, one can see that both <math>U = A \sqrt{\Sigma}</math> and <br />
<math>V = B\sqrt{\Sigma}</math> satisfy this lemma.<br />
<br />
Based on the lemma 1 the optimization problem can be reformulated as:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \textrm{Hinge}(X|Y)</math><br />
<br />
== Relation Between Rank And Trace Norm ==<br />
<br />
In the literature, the notion of trace norm has been widely used instead of dealing with the rank of matrices. The next theorem explains the relation between the rank and the trace norm:<br />
<br />
'''Theorem 1:'''<br /><br />
The convex envelope (smallest convex bounding function) of the rank function, on matrices with unit spectral norm, is the trace norm function. <br /><br />
''Spectral Norm'' of a matrix is the absolute value of its largest eigenvalue.<br />
<br />
In addition it can be easily be shown that the relation between the norm, trace norm and rank of a matrix is as follows:<br />
<br />
<math>\forall X: \|X\|_{F} \leq \|X\|_{T} \leq \sqrt{Rank(X)} \cdot \|X\|_{F}</math><br />
<br />
Based on this relation, and the fact that the trace norm is a convex function, it can be shown:<br />
<br />
<math>\{X| \|X\|_{F} \leq \alpha\} = conv \{uv^T| u \in \Re^n , v \in \Re^m , |u| = |v| = \alpha\}</math><br />
<br />
which shows that the set of matrices with a bounded trace norm is Convex; also it seems that this set has the lowest rank matrices on its boundary. Thus, this optimization problem is in fact the process of searching in a convex set to optimize a convex function.<br />
<br />
== Soft Margin Optimization ==<br />
<br />
As it is possible in SVM that no linear separator can be found to satisfy the constraints (classes are not linearly separable), there exists some Y for which no factorization preserves all the knowledge from Y. So the same solution as soft-margin SVM can be used here, and one can add slack variables to the loss function:<br />
<br />
<math>\displaystyle \min_X \|X\|_{T} + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math> ''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
== Using Semi-Definite Programming ==<br />
<br />
Now this optimization problem is easy to understand, and also it is convex with linear constrains. To solve, it should be reformulated to one of the known convex optimization problems. Next lemma shows how ''Semi-Definite Programming'' can be used to solve this optimization:<br />
<br />
'''Lemma 2:'''<br /><br />
<math>\forall X \in \Re^{n \cdot m}</math> and <math>t \in \Re : \|X\|_{T} \leq t \Longleftrightarrow \exists A, B</math> '''s.t.''' <br />
<math> M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right) \succeq 0</math> and <math>Tr(M) \leq 2t</math><br />
<br />
Therefore the last optimization problem can be formulated as a linear, convex SDP optimization problem:<br />
<br />
<math>\displaystyle \min_{M \succeq 0} Tr(M) + c \cdot \displaystyle \sum_{(i,j) \in S} \xi_{ij}</math> <br /><br />
'''s.t.''' <math>M = \left( \begin{array}{cc} A & X \\ X^T & B \\ \end{array} \right)\succeq 0</math> <br />
'''and''' <math>\forall (i,j) \in S: \xi_{ij} \geq 0</math>''' , ''' <math>x_{ij} \cdot y_{ij} \geq 1 - \xi_{ij}</math><br />
<br />
which has a dual form with a simple structure. Also the prediction can be done using a solution to the dual problem directly.<br />
<br />
== Experiments ==<br />
<br />
Preliminary experiments was performed on a subset of the 100K MovieLens Dataset, consisting of the 100 users and 100 movies with the most ratings. To do this, we used CSDP to solve the resulting SDPs. The ratings are on a discrete scale of one through five, and we experimented with both generalizations of the hinge loss above, allowing per-user thresholds. As the "base line" learners, we used WLRA and K-Medians. <br />
The the data was split into four sets. For each of the these four test sets, we used the remaining sets to calculate a 3-fold cross-validation (CV) error for each method (WLRA, K-medians, trace norm and max-norm MMMF with immediate-threshold and allthreshold<br />
hinge loss) using a range of parameters (rank for WLRA, number of centers for K-medians, slack cost for MMMF). For each of the four splits, we selected the two MMMF learners with lowest CV ZOE andMAE and the two Baseline learners with lowest CV ZOE and MAE, and measured their error on the held-out test data. <br />
[[File:Paer4-Table1.JPG]]<br />
<br />
Table 1 lists these CV and test errors, and the average test error across all four test sets. On average and on three of the four test sets, MMMF achieves lower MAE than the Baseline learners; on all four of the test sets, MMMF achieves lower ZOE than the Baseline learners.<br />
<br />
== Limitation ==<br />
It is unrealistic that observed entries are assumed to be uniformly sampled. For example, Users tend to rate items they like. In fact, allowing any sampling distribution would guarantee low error on items the user likes, but not on items he would really like based on our prediction.</div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3634relevant Component Analysis2009-07-29T01:16:52Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size. However the size matters as any increasing in the size would increase the likelihood of well-done approximation of the class mean.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3633relevant Component Analysis2009-07-29T01:10:34Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet. Here, we are seeking a chunklet which makes a good mean value approximation of a class, regardless of the chunklet’s size.<br />
<br />
<br />
Hence a “good”c hunklet is a chunklet which approximates the mean value of<br />
a class well, regardless of the chunklet’s size. However, size matters because as<br />
the size of the chunklet increases, the likelihood that its mean approximates<br />
correctly the class mean also increases.<br />
<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3632relevant Component Analysis2009-07-29T00:52:15Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to class variability are irrelevant for classification and the computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3631relevant Component Analysis2009-07-29T00:49:20Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
Those directions in which the data variability is due to within class variability are irrelevant for the task of classification; The computed W assigns lower weight to these directions.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3630relevant Component Analysis2009-07-29T00:43:45Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
"In effect, the whitening transformation W assigns lower weight to some directions in the<br /><br />
original feature space; those are the directions in which the data ~_ri-ability is mainly due<br /><br />
to within class variability, and is therefore "irrelevant" for the task of classification." [3]<br />
<br />
In effect, the whitening transformation W assignslower weight to some directions in the original featurespace; those are the directions in which the data ~_ri-ability is mainly due to within class variability, and is therefore "irrelevant" for the task of classification.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3629relevant Component Analysis2009-07-29T00:43:14Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
"In effect, the whitening transformation W assigns lower weight to some directions in the<br />
original feature space; those are the directions in which the data ~_ri-ability is mainly due <br />
to within class variability, and is therefore "irrelevant" for the task of classification." [3]<br />
<br />
In effect, the whitening transformation W assignslower weight to some directions in the original featurespace; those are the directions in which the data ~_ri-ability is mainly due to within class variability, and is therefore "irrelevant" for the task of classification.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3628relevant Component Analysis2009-07-29T00:42:38Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
:: "In effect, the whitening transformation W assigns lower weight to some directions in the<br />
:: original feature space; those are the directions in which the data ~_ri-ability is mainly due <br />
:: to within class variability, and is therefore "irrelevant" for the task of classification." [3]<br />
<br />
In effect, the whitening transformation W assignslower weight to some directions in the original featurespace; those are the directions in which the data ~_ri-ability is mainly due to within class variability, and is therefore "irrelevant" for the task of classification.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3627relevant Component Analysis2009-07-29T00:41:52Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
:: "In effect, the whitening transformation W assigns lower weight to some directions in the<br />
original feature space; those are the directions in which the data ~_ri-ability is mainly due <br />
to within class variability, and is therefore "irrelevant" for the task of classification." [3]<br />
<br />
In effect, the whitening transformation W assignslower weight to some directions in the original featurespace; those are the directions in which the data ~_ri-ability is mainly due to within class variability, and is therefore "irrelevant" for the task of classification.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=relevant_Component_Analysis&diff=3626relevant Component Analysis2009-07-29T00:40:12Z<p>Myakhave: /* First paper: Shental et al., 2002 N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790. */</p>
<hr />
<div>== First paper: Shental ''et al.'', 2002 <ref>N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.</ref> ==<br />
<br />
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.<br />
<br />
:: ''Definition of irrelevant variability:'' We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].<br />
<br />
To achieve this goal, Shental ''et al.'' introduced the idea of ''chunklets'' – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term ''adjustment learning'' to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.<br />
<br />
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering. <br />
<br />
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ<sup>-1/2</sup>, where V and Λ can be found by the singular value decomposition of Σ.<br />
<br />
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The ''chunklet scatter matrix'' is calculated by<br />
<br />
:: <math>S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n)</math><br />
<br />
where |Ω| is the size of the data set, H<sub>n</sub> is the nth chunklet, |H<sub>n</sub>| is the size of the nth chunklet, and N is the number of chunklets.<br />
<br />
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.<br />
<br />
The steps of the RCA algorithm are as follows:<br />
<br />
:: "1. Calculate S<sub>ch</sub>... Let r denote its effective rank (the number of singular values of S<sub>ch</sub> which are significantly larger than 0).<br />
:: 2. Compute the total covariance (scatter) matrix of the original data S<sub>T</sub>, and project the data using PCA to its r largest dimensions.<br />
:: 3. Project S<sub>ch</sub> onto the reduced dimensional space, and compute the corresponding whitening transformation W.<br />
:: 4. Apply W to the original data (in the reduced space)." [1]<br /><br />
"In effect, the whitening transformation W assigns lower weight to some directions in the<br />
original feature space; those are the directions in which the data ~_ri-ability is mainly due <br />
to within class variability, and is therefore "irrelevant" for the task of classification."[3]<br />
<br />
In effect, the whitening transformation W assignslower weight to some directions in the original featurespace; those are the directions in which the data ~_ri-ability is mainly due to within class variability, and is therefore "irrelevant" for the task of classification.<br />
<br />
'''Experimental Results: Face Recognition'''<br />
<br />
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.) <br />
<br />
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.<br />
<br />
In <ref> M. Sorci,G. Antonini, and Jean-Philippe Thiran, "Fisher's discriminant and relevant component analysis for static facial expression classification."</ref>, it's shown that RCA in combination with FLD results in better classifier in the context of facial expression recognition framework as compared to RCA alone. This combination has results comparable to SVM.<br />
<br />
'''Experimental Results: Surveillance'''<br />
<br />
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).<br />
<br />
== Second Paper: Bar-Hillel ''et al.'', 2003 <ref> A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18. </ref> ==<br />
<br />
In a subsequent work [2], Bar-Hillel ''et al.'' described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing ''et al.'' [3].<br />
<br />
'''Information Maximization'''<br />
<br />
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:<br />
<br />
::<math>\max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K</math><br />
<br />
where f is a transformation function, m<sub>j</sub><sup>y</sup> is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.<br />
<br />
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known. <br />
<br />
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have p<sub>y</sub>(y) = p<sub>x</sub>(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,<br />
<br />
::<math><br />
\begin{align}<br />
H(Y) & = -\int_y p(y)\log p(y)\, dy \\<br />
& = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\<br />
& = H(X) + \langle \log |J(x)| \rangle_x<br />
\end{align}<br />
</math><br />
<br />
Assuming a linear transformation Y = AX, the Jacobian is simply equal to the constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as<br />
<br />
::<math>\max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K</math><br />
<br />
This can also be expressed in terms of the Mahalanobis distance matrix B = A<sup>t</sup>A as follows, noting that log |A| = (1/2) log |B|.<br />
<br />
::<math>\max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B > 0</math><br />
<br />
The solution to this problem is <math>B = \tfrac{K}{N} \hat{C}^{-1}</math>, where <math>\hat{C}</math> is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.<br />
<br />
<br />
'''Within-Chunklet Distance Minimization'''<br />
<br />
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem<br />
<br />
::<math>\min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1</math> <br />
<br />
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing ''et al.''’s method, which similarly learns a distance metric based on similarity side information. Xing ''et al.''’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize<br />
<br />
::<math>\min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 </math><br />
<br />
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.<br />
<br />
The authors compared the clustering performance of RCA with Xing ''et al.''’s method <ref> E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002. </ref> using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as<br />
<br />
::<math>\sum_{i > j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)}</math><br />
<br />
where 1{ } is the indicator function, <math>\hat{c}</math> is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.<br />
<br />
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing ''et al.''’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing ''et al.''’s method requires iterative gradient descent.<br />
<br />
== Suggestions/Critique ==<br />
<br />
* RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.<br />
<br />
* RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.<br />
<br />
* In the UC Irvine experiments, RCA shows similar performance to Xing ''et al.''’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.<br />
<br />
<br />
====Why Equal Variances for Chanklets ====<br />
<br />
In [2] authors suppose that <math> C_{m} </math> is the random variable which shows distribution of data in class <math> m </math> and then, assuming equality for class variances they calculate <math> S_{ch} </math> as it was mentioned above.<br><br />
<br />
Further, suppose that data in class <math> m </math> are dependent on another source of variation <math> G </math> besides the class characteristics (<math> G </math> can be global variation or sensor characteristics). Now the random variable for <math> m </math>th class is <math> X=C_{m}+G </math>, where global impact (<math> G </math>) is the same for all classes, <math> G </math> is independent of <math> C_{m} </math> and global variation is larger than class variation (<math> \Sigma_{m}<\Sigma_{G} </math>). <br><br />
<br />
In this situation variance for class <math> m </math> will be <math> \Sigma_{m}+\Sigma_{G} </math>, but by assumption it will be dominated by <math> \Sigma_{G} </math>. This result brings us back to the case <math> \Sigma_{m}=\Sigma_{G} </math> for all classes again.<br><br />
<br />
== Kernel RCA==<br />
<br />
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA. <br><br />
<br />
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)<br><br />
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors<br><br />
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.<br><br />
<br />
To overcome this restrictions Tesang and colleagues (2005)<ref> Tsang, I. W. and Colleagues; Kernel Relevant Component Analysis For Distance Metric Learning. International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 </ref> suggested to use kernel in RCA and showed how one can kernelize RCA.<br />
<br />
===Kernelizing RCA===<br />
For <math>k</math> given chunklets, each containing <math>n_{i}</math> patterns <math>\left\{x_{i,1},...,x_{i,n_{i}} \right\}</math> the covariance matrix of centered patterns is as follow:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} </math><br />
<br />
and the associated whitening transform is as<br />
<br />
<math>x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x </math><br />
<br />
Now let <math>\left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\}</math> then C can be written as:<br />
<br />
<math>C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} </math><br />
<br />
where <math>1_{i}</math> is <math>n \times 1</math> vector such that:<br />
<br />
<math> [1_{i}]_{j}= \left\{\begin{matrix} <br />
1 & \text{patern} j \in \text{chunklet} i \\ <br />
0 & \text{otherwise} \end{matrix}\right.</math><br />
<br />
and <math>I_{i}=diag\left(1_{i}\right)</math>.<br />
<br />
using the above notations C can be simplified to the form <math>C=\frac{1}{n}XHX^{'}</math><br />
<br />
where <math> H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right)</math><br />
<br />
for the issue of non-singularity, for small <math> \epsilon </math> let <math>\hat{C}=C+\epsilon I</math> then we can find the inverse of <math>\hat{C}</math> which is <br />
<br />
<math>\hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'}</math><br />
<br />
Therefor the the inner product of transformed <math>x</math> and <math>y</math> is <br />
<br />
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)^{'} \left(\hat{C}^{-\frac{1}{2}}x\right)= x^{'} \hat{C}^{-1} y= x^{'} \left( \frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} \right) y </math><br />
<br />
Now if RCA operates in feature <math> \mathcal{F}</math> with corresponding kernel <math> l </math> then the inner product between nonlinear transformations <math> \varphi (x)</math> and <math> \varphi (y)</math> after running RCA in <math> \mathcal{F}</math> is:<br />
<br />
<math> \tilde{l}(x,y)=\frac{1}{\epsilon}l(x,y)-l_{x}^{'} \left( \frac{1}{n \epsilon^{2}}H \left( I+\frac{1}{n \epsilon}LH \right)^{-1} \right) l_{x} </math><br />
<br />
where <math>L=\left[ l(x_{i},x_{j}) \right]_{ij}</math>, <math> l_{x}=\left[ l(x_{1,1},x),...,l(x_{k,n_{k}},x) \right]^{'}</math><br />
and <math> l_{y}=\left[ l(x_{1,1},y),...,l(x_{k,n_{k}},y) \right]^{'}</math><br />
<br />
== References ==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3606visualizing Data using t-SNE2009-07-28T21:57:43Z<p>Myakhave: /* Compensation for Mismatched Dimensionality by Mismatched Tails */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>. In this technique, called UNI-SNE, <math>\,q_{ij}</math> will be larger than <math>\,p_{ij}</math> even for the far-apart datapoints.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. In addition, t-distribution as a heavier tail distribution allows a temperate distance to be modeled by a larger distance in the map that eliminates the unwanted attractive forces between dissimilar data points.<br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.<br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3605visualizing Data using t-SNE2009-07-28T21:40:11Z<p>Myakhave: /* The Crowding Problem */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>. In this technique, called UNI-SNE, <math>\,q_{ij}</math> will be larger than <math>\,p_{ij}</math> even for the far-apart datapoints.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. <br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.<br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3604visualizing Data using t-SNE2009-07-28T21:34:28Z<p>Myakhave: /* The Crowding Problem */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>\,q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. <br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.<br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3603visualizing Data using t-SNE2009-07-28T21:31:31Z<p>Myakhave: /* The Crowding Problem */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\,\rho</math>, helps <math>q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. <br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.<br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3602visualizing Data using t-SNE2009-07-28T21:31:00Z<p>Myakhave: /* The Crowding Problem */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\rho</math>, helps <math>q_{ij}</math> never fall below <math>\frac{2\rho}{n(n-1)}</math>.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. <br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.<br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3601visualizing Data using t-SNE2009-07-28T21:30:27Z<p>Myakhave: /* The Crowding Problem */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br /><br />
According to Cook et al.(2007), adding a slight repulsion can address this problem. Using a uniform backgorund model with a small mixing proportion, <math>\ro</math>, helps <math>q_{ij}</math> never fall below <math>\frac{2\ro}{n(n-1)}</math>.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. <br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.<br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Data_using_t-SNE&diff=3600visualizing Data using t-SNE2009-07-28T21:17:51Z<p>Myakhave: /* The Crowding Problem */</p>
<hr />
<div>==Introduction==<br />
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. ''Journal of Machine Learning Research'', 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In ''Advances in Neural Information Processing Systems'', vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.<br />
<br />
==Stochastic Neighbor Embedding==<br />
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint <math> \mathbf x_j </math> to datapoint <math> \mathbf x_i </math> is then presented by the conditional probability, <math> \mathbf p_{j|i} </math>, that <math> \mathbf x_i </math> would pick <math> \mathbf x_j </math> as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on <math> \mathbf x_i </math>. The <math> \mathbf p_{j|i} </math> is given as<br />
<br />
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> <br />
<br />
where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. <br />
<br />
With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by<br />
<br />
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
<br />
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqrt{2} } </math> (a different value will only result in rescaling of the final map). And again, we set <math> \mathbf q_{i|i} = 0 </math>.<br />
<br />
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math> should be equal. Therefore, the aim of SNE is to minimize the mismatch between <math> \mathbf q_{j|i} </math> and <math> \mathbf p_{j|i} </math>. This is achieved by minimizing the sum of Kullback-leibler divergence (a non-symmetric measure of the difference between two probability distributions) over all datapoints. The cost function of SNE is then expressed as <br />
<br />
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
where <math> \mathbf P_i </math> and <math> \mathbf Q_i </math> are the conditional probability distribution over all other points for given <math> \mathbf x_i </math> and <math> \mathbf y_i </math>. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small <math> \mathbf q_{j|i} </math> to model a big <math> \mathbf p_{j|i} </math>, while a small cost for using a large <math> \mathbf q_{j|i} </math> to model a small <math> \mathbf p_{j|i} </math>. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.<br />
<br />
The remaining parameter <math> \mathbf \sigma_i </math> here is selected by performing a binary search for the value of <math> \mathbf \sigma_i </math> that produces a <math> \mathbf P_i </math> with a fixed perplexity (a measure of the effective number of neighbors, which is related to <math> \mathbf k </math>, defined as the two to the power of Shannon entropy of <math>P_i</math>) that is selected by the user.<br />
<br />
To minimize the cost function, gradient descent method is used. The gradient then is given as<br />
<br />
<br> <center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point <math> \mathbf y_i </math> and all other neighbor points <math> \mathbf y_j </math>, where the force is exerted in the direction <math> \mathbf (y_i-y_j) </math> and the stiffness of the spring is <math> \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math>.<br />
<br />
==t-Distributed Stochastic Neighbor Embedding==<br />
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space. <br />
<br />
=== Symmetric SNE ===<br />
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, <math> \mathbf P </math> in the high-dimensional space and <math> \mathbf Q </math> in the low-dimensional space.<br />
<br />
In this case, the pairwise similarities of the data points in high-dimensional space is given by,<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
and <math> \mathbf q_{ij} </math> in low-dimensional space is<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) }</math> </center><br />
<br />
where <math> \mathbf p_{ii} </math> and <math> \mathbf q_{ii} </math> are still zero. When a high-dimensional datapoint <math> \mathbf x_i </math> is a outlier (far from all the other points), we set <math> \mathbf{p_{ij}=\frac {(p_{j|i}+p_{i|j})}{2n}} </math> to ensure that <math>\sum_{j} p_{ij} > \frac {1}{2n} </math> for all <math> \mathbf x_i </math>. This will make sure that all <math> \mathbf x_i </math> make significant contribution to the cost function, which is given as<br />
<br />
<center> <math> C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
As we can see, by definition, we have <math> \mathbf p_{ij} = p_{ji} </math> and <math> \mathbf q_{ij} = q_{ji} </math>. This is why it is called symmetric SNE.<br />
<br />
From the cost function, we have the gradient as simple as<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) </math> </center><br />
<br />
which is the main advantage of symmetric SNE.<br />
<br />
=== The Crowding Problem ===<br />
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around <math> i </math>, and we try to model the pairwise distances from <math> i </math> to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint <math> i </math> to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.<br />
<br />
=== Compensation for Mismatched Dimensionality by Mismatched Tails ===<br />
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints nearby, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve any exponential. <br />
<br />
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional <math> \mathbf p_{ij} </math> are still<br />
<br />
<center> <math> \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} </math> </center><br />
<br />
while the joint probabilities <math> \mathbf q_{ij} </math> are defined as <br />
<br />
<center> <math> \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}}</math> </center><br />
<br />
The gradient of the Kullback-Leibler divergence between <math> P </math> and the Student-t based joint probability distribution <math> Q </math> is then given by<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} </math> </center><br />
<br />
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. ''In Proceeding of the 11<sup>th</sup> International Conference on Artificial Intelligence and Statistics'', volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that was mentioned above. At the same time, these repulsions do not go to infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.<br />
<br />
=== Optimization Methods for t-SNE ===<br />
One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the <math> \mathbf p_{ij} </math>'s by a <math> n>1 </math> in the initial stages of the optimization. This will make all the <math> \mathbf q_{ij} </math>'s too small to model their corresponding <math> \mathbf p_{ij} </math>'s, so that the modeling are forced to focus on large <math> \mathbf p_{ij} </math>'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.<br />
<br />
==Experiments with Different Data Sets==<br />
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set. <br />
<br />
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.<br />
<br />
==t-SNE for Large Data Sets==<br />
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. <math> \mathbf p_{j|i} </math> denotes the fraction of random walk starting at landmark point <math> x_i </math> and terminate at landmark point <math> x_j </math>. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities <math> \mathbf p_{j|i} </math> can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.<br />
<br />
==Weaknesses of t-SNE==<br />
Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Similarity_Data_with_a_Mixture_of_Maps&diff=2856visualizing Similarity Data with a Mixture of Maps2009-07-11T00:26:36Z<p>Myakhave: /* Modeling Human Word Association Data */</p>
<hr />
<div>== Introduction ==<br />
<br />
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.<br />
<br />
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.<br />
<br />
== Stochastic Neighbour Embedding ==<br />
<br />
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref><br />
lies in converting high-dimensional distance or similarity data into a set of <math> \mathbf{ p_{j|i} }</math>, each of which represent the probability that one object <math> i </math> pick another object <math> j </math> as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find <math> \mathbf{ p_{j|i} } </math> for each object <math> i </math> by using a spherical Gaussian distribution centered at the high-dimensional position of <math> i </math>, <math> \mathbf{ X_{i}} </math>. We will set <math> \mathbf{ p_{i|i} = 0 }</math> and for <math> \mathbf{ j \neq i } </math>,<br />
<br />
<center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center><br />
<br />
Intuitively, if one object, say <math>\, i </math>, is only allowed to pick one neighbor, say <math>\, j </math>, <math>\, j </math> should be the best one with the minimum relative distance. In other words, the more the relative distance from <math>\, i </math>, the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define <math> \mathbf p_{j|i}</math> so that the numerator is proportional to <math>\, j </math>'s distance from <math>\, i </math> and the denominator is proportional to sum of all probable neighbors' distance from <math>\, i </math>.<br /><br />
<br />
Note that given a set of pairwise distances between objects, <math> \mathbf{|| x_i - x_j ||} </math>, we can use the above equation to derive the same probabilities. In practice, given a set of <math> N </math> points, we set the variance of the Gaussian <math> \mathbf{ \sigma_i ^2} </math>, either by hand or we find it by a binary search for the values of <math> \mathbf{ \sigma_i } </math> that make the entropy of the distribution over neighbours equal to <math> \mathbf{ \log_2 M} </math> (Remember that the entropy of the distribution <math> \mathbf{ P_i} </math> is defined as <math> \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx </math> and <math> \mathbf{ p(x)\log(1/p(x))} </math> is understood to be zero when <math> \mathbf{p(x)=0)} </math>.) This is done by starting from a number <math> \mathbf{ M \ll N} </math> and performing the binary search until the entropy of <math> \mathbf{ P_i} </math> is within some predetermined small tolerance of <math> \mathbf{\log_2 M } </math>. <br />
<br />
Our main goal in SNE is to model <math>\mathbf{p_{j|i}}</math> by using the conditional probabilities <math>\mathbf{q_{j|i}}</math>, which are determined by the locations <math>\mathbf{ y_i} </math> of points in low-dimensional space: <br />
<center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original <br />
<math> \mathbf{p_{j|i}} </math> and induced <math> \mathbf{ q_{j|i}} </math> distributions over neighbours for each object:<br />
<br />
<center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
The dimensionality of the <math> \mathbf{Y} </math> space is chosen to be much less than the number of objects. Notice that making <math> \mathbf{ q_{j|i}} </math> large when <math> \mathbf{ p_{j|i}} </math> is small wastes some of the probability mass in the <math> \mathbf{Q} </math> distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces ''both'' keeping the images of nearby objects nearby ''and'' keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating <math> \mathbf{C} </math> is tedious because <math> \mathbf{y_k} </math> affects <math> \mathbf{ q_{j|i}} </math> via the normalization term in its definition, the final result is simple and has nice physical interpretation:<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
Using the steepest descent for minimizing <math> \mathbf{C} </math> in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the <math> \mathbf{y} </math> values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of <math> \mathbf{C} </math>.<br />
<br />
== Symmetric SNE ==<br />
<br />
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:<br />
<br />
In this case we define <math> \mathbf{p_{ij}} </math> by<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
<math> \mathbf{q_{ij}} </math>'s are defined by<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2) }</math> </center><br />
<br />
and finally the symmetric version of our cost function, <math> \mathbf{C_{sym}} </math>, becomes the KL divergence between the two distributions<br />
<br />
<center> <math> C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
The benefit of defining <math> \mathbf{p_{ij}} </math>'s like this is getting much simpler derivatives. If one of the high-dimensional points, <math> \mathbf{j} </math>, is far from all of the others, all of the <math> \mathbf{p_{.j}} </math> will be very small. In this case, we replace <math> \mathbf{p_{ij}} </math> by <math> \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} </math>. When <math> \mathbf{j} </math> is far from all the other points, all of the <math> \mathbf{p_{j|i}} </math> will be very small, but <math> \mathbf{p_{.|j}} </math> will sum to 1.<br />
<br />
== Aspect Maps ==<br />
<br />
Another approach for defining <math> \mathbf{q_{j|i}} </math> is allowing <math> \mathbf{i} </math> and <math> \mathbf{j} </math> to occur in several different two-dimensional maps and assigning a mixing proportion <math> \mathbf{\pi_{i}^{m}} </math> in m-th map to the object <math> \mathbf{i} </math>. Note that we should have <math> \mathbf{\sum_{m} \pi_{i}^{m}=1} </math>. Now by using these different maps, we define <math> \mathbf{q_{j|i}} </math> as follows:<br />
<br />
<center> <math> q_{j|i} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} </math> </center><br />
<br />
where<br />
<br />
<center> <math> d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} </math> </center><br />
<br />
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high <math> \mathbf{q_{j|i}} </math> no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that ''there is'' at least one map in which <math> \mathbf{i} </math> is close to <math> \mathbf{j} </math> ''and'' provided that the versions of <math> \mathbf{i} </math> and <math> \mathbf{j} </math> in that map have high mixing proportions, it is possible for for <math> \mathbf{q_{j|i}} </math> to be quite large even if <math> \mathbf{i} </math> and <math> \mathbf{j} </math> are far apart in all the other maps. <br />
<br />
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] </math> </center><br />
<br />
Now by substituting the definition of <math> \mathbf{z_k} </math> and reshuffling the terms we will have:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} </math> </center><br />
<br />
In practice, we will not be using mixing proportions <math> \mathbf{\pi_i^m} </math> themselves as parameters of the model; Instead, we define <math> \mathbf{w_i^m} </math> by: <br />
<br />
<center> <math> \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} </math> </center><br />
<br />
as a result of that, the gradient becomes:<br />
<br />
<center> <math> \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] </math> </center><br />
<br />
== Modeling Human Word Association Data ==<br />
<br />
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study <br />
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.<br /><br />
SNE has some problems with ambiguous words such as 'fire', 'wood', and 'job'. Puting 'fire' close to 'wood' , and 'job' where the two latter words are not related is a misconception in SNE. AMSNE has a solution for this kind of problem by considering the word 'fire' as a mixture of two different meanings. 'fire', in one map, is close to 'wood' and in the other is close to 'job'. Besides ambiguity, a word belong to different places for a different reason. This may be seen in the word 'death' which is close to 'sad' & 'cancer' and 'military' & 'destruction' as well.<br />
<br />
== Applications==<br />
<br />
In NIPS 2008 Conference, there was a demonstration by L. van der Maaten and G.Hinton on Visualizing NIPS Cooperations using Multiple Maps t-SNE <ref>http://nips.cc/Conferences/2008/Program/event.php?ID=1472 </ref>. Their demonstration showed visualizations of NIPS co-authorships that were constructed by the multiple maps version of t-SNE. They showed that it is impossible for multidimensional scaling techniques to construct an appropriate visualization of the similarity data, because of the triangle inequality problem and therefore they created multiple maps version of t-SNE based on the idea proposed in this paper. In these maps, each author has a copy , and each copy is weighted by a mixing proportion (the mixing proportions for a single author over all maps sum up to 1). The multiple maps version of t-SNE can deal with the triangle inequality problem, and as a result, it is very good at visualizing NIPS co-authorship data.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Similarity_Data_with_a_Mixture_of_Maps&diff=2855visualizing Similarity Data with a Mixture of Maps2009-07-11T00:21:28Z<p>Myakhave: /* Modeling Human Word Association Data */</p>
<hr />
<div>== Introduction ==<br />
<br />
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.<br />
<br />
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.<br />
<br />
== Stochastic Neighbour Embedding ==<br />
<br />
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref><br />
lies in converting high-dimensional distance or similarity data into a set of <math> \mathbf{ p_{j|i} }</math>, each of which represent the probability that one object <math> i </math> pick another object <math> j </math> as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find <math> \mathbf{ p_{j|i} } </math> for each object <math> i </math> by using a spherical Gaussian distribution centered at the high-dimensional position of <math> i </math>, <math> \mathbf{ X_{i}} </math>. We will set <math> \mathbf{ p_{i|i} = 0 }</math> and for <math> \mathbf{ j \neq i } </math>,<br />
<br />
<center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center><br />
<br />
Intuitively, if one object, say <math>\, i </math>, is only allowed to pick one neighbor, say <math>\, j </math>, <math>\, j </math> should be the best one with the minimum relative distance. In other words, the more the relative distance from <math>\, i </math>, the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define <math> \mathbf p_{j|i}</math> so that the numerator is proportional to <math>\, j </math>'s distance from <math>\, i </math> and the denominator is proportional to sum of all probable neighbors' distance from <math>\, i </math>.<br /><br />
<br />
Note that given a set of pairwise distances between objects, <math> \mathbf{|| x_i - x_j ||} </math>, we can use the above equation to derive the same probabilities. In practice, given a set of <math> N </math> points, we set the variance of the Gaussian <math> \mathbf{ \sigma_i ^2} </math>, either by hand or we find it by a binary search for the values of <math> \mathbf{ \sigma_i } </math> that make the entropy of the distribution over neighbours equal to <math> \mathbf{ \log_2 M} </math> (Remember that the entropy of the distribution <math> \mathbf{ P_i} </math> is defined as <math> \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx </math> and <math> \mathbf{ p(x)\log(1/p(x))} </math> is understood to be zero when <math> \mathbf{p(x)=0)} </math>.) This is done by starting from a number <math> \mathbf{ M \ll N} </math> and performing the binary search until the entropy of <math> \mathbf{ P_i} </math> is within some predetermined small tolerance of <math> \mathbf{\log_2 M } </math>. <br />
<br />
Our main goal in SNE is to model <math>\mathbf{p_{j|i}}</math> by using the conditional probabilities <math>\mathbf{q_{j|i}}</math>, which are determined by the locations <math>\mathbf{ y_i} </math> of points in low-dimensional space: <br />
<center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original <br />
<math> \mathbf{p_{j|i}} </math> and induced <math> \mathbf{ q_{j|i}} </math> distributions over neighbours for each object:<br />
<br />
<center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
The dimensionality of the <math> \mathbf{Y} </math> space is chosen to be much less than the number of objects. Notice that making <math> \mathbf{ q_{j|i}} </math> large when <math> \mathbf{ p_{j|i}} </math> is small wastes some of the probability mass in the <math> \mathbf{Q} </math> distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces ''both'' keeping the images of nearby objects nearby ''and'' keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating <math> \mathbf{C} </math> is tedious because <math> \mathbf{y_k} </math> affects <math> \mathbf{ q_{j|i}} </math> via the normalization term in its definition, the final result is simple and has nice physical interpretation:<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
Using the steepest descent for minimizing <math> \mathbf{C} </math> in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the <math> \mathbf{y} </math> values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of <math> \mathbf{C} </math>.<br />
<br />
== Symmetric SNE ==<br />
<br />
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:<br />
<br />
In this case we define <math> \mathbf{p_{ij}} </math> by<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
<math> \mathbf{q_{ij}} </math>'s are defined by<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2) }</math> </center><br />
<br />
and finally the symmetric version of our cost function, <math> \mathbf{C_{sym}} </math>, becomes the KL divergence between the two distributions<br />
<br />
<center> <math> C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
The benefit of defining <math> \mathbf{p_{ij}} </math>'s like this is getting much simpler derivatives. If one of the high-dimensional points, <math> \mathbf{j} </math>, is far from all of the others, all of the <math> \mathbf{p_{.j}} </math> will be very small. In this case, we replace <math> \mathbf{p_{ij}} </math> by <math> \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} </math>. When <math> \mathbf{j} </math> is far from all the other points, all of the <math> \mathbf{p_{j|i}} </math> will be very small, but <math> \mathbf{p_{.|j}} </math> will sum to 1.<br />
<br />
== Aspect Maps ==<br />
<br />
Another approach for defining <math> \mathbf{q_{j|i}} </math> is allowing <math> \mathbf{i} </math> and <math> \mathbf{j} </math> to occur in several different two-dimensional maps and assigning a mixing proportion <math> \mathbf{\pi_{i}^{m}} </math> in m-th map to the object <math> \mathbf{i} </math>. Note that we should have <math> \mathbf{\sum_{m} \pi_{i}^{m}=1} </math>. Now by using these different maps, we define <math> \mathbf{q_{j|i}} </math> as follows:<br />
<br />
<center> <math> q_{j|i} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} </math> </center><br />
<br />
where<br />
<br />
<center> <math> d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} </math> </center><br />
<br />
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high <math> \mathbf{q_{j|i}} </math> no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that ''there is'' at least one map in which <math> \mathbf{i} </math> is close to <math> \mathbf{j} </math> ''and'' provided that the versions of <math> \mathbf{i} </math> and <math> \mathbf{j} </math> in that map have high mixing proportions, it is possible for for <math> \mathbf{q_{j|i}} </math> to be quite large even if <math> \mathbf{i} </math> and <math> \mathbf{j} </math> are far apart in all the other maps. <br />
<br />
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] </math> </center><br />
<br />
Now by substituting the definition of <math> \mathbf{z_k} </math> and reshuffling the terms we will have:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} </math> </center><br />
<br />
In practice, we will not be using mixing proportions <math> \mathbf{\pi_i^m} </math> themselves as parameters of the model; Instead, we define <math> \mathbf{w_i^m} </math> by: <br />
<br />
<center> <math> \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} </math> </center><br />
<br />
as a result of that, the gradient becomes:<br />
<br />
<center> <math> \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] </math> </center><br />
<br />
== Modeling Human Word Association Data ==<br />
<br />
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study <br />
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.<br /><br />
SNE has some problems with ambiguous words such as 'fire', 'wood', and 'job'. Puting 'fire' close to 'wood' , and 'job' where the two latter words are not related is a misconception in SNE. AMSNE has a solution for this kind of problem by considering the word 'fire' as a mixture of two different meanings. 'fire', in one map, is close to 'wood' and in the other is close to 'job'. Besides ambiguity, a word belong to different places for a different reason. This may be seen in the word 'death' close to 'sad' & 'cancer' and also 'military' & 'destruction'.<br />
<br />
== Applications==<br />
<br />
In NIPS 2008 Conference, there was a demonstration by L. van der Maaten and G.Hinton on Visualizing NIPS Cooperations using Multiple Maps t-SNE <ref>http://nips.cc/Conferences/2008/Program/event.php?ID=1472 </ref>. Their demonstration showed visualizations of NIPS co-authorships that were constructed by the multiple maps version of t-SNE. They showed that it is impossible for multidimensional scaling techniques to construct an appropriate visualization of the similarity data, because of the triangle inequality problem and therefore they created multiple maps version of t-SNE based on the idea proposed in this paper. In these maps, each author has a copy , and each copy is weighted by a mixing proportion (the mixing proportions for a single author over all maps sum up to 1). The multiple maps version of t-SNE can deal with the triangle inequality problem, and as a result, it is very good at visualizing NIPS co-authorship data.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Similarity_Data_with_a_Mixture_of_Maps&diff=2854visualizing Similarity Data with a Mixture of Maps2009-07-11T00:15:47Z<p>Myakhave: /* Modeling Human Word Association Data */</p>
<hr />
<div>== Introduction ==<br />
<br />
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.<br />
<br />
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.<br />
<br />
== Stochastic Neighbour Embedding ==<br />
<br />
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref><br />
lies in converting high-dimensional distance or similarity data into a set of <math> \mathbf{ p_{j|i} }</math>, each of which represent the probability that one object <math> i </math> pick another object <math> j </math> as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find <math> \mathbf{ p_{j|i} } </math> for each object <math> i </math> by using a spherical Gaussian distribution centered at the high-dimensional position of <math> i </math>, <math> \mathbf{ X_{i}} </math>. We will set <math> \mathbf{ p_{i|i} = 0 }</math> and for <math> \mathbf{ j \neq i } </math>,<br />
<br />
<center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center><br />
<br />
Intuitively, if one object, say <math>\, i </math>, is only allowed to pick one neighbor, say <math>\, j </math>, <math>\, j </math> should be the best one with the minimum relative distance. In other words, the more the relative distance from <math>\, i </math>, the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define <math> \mathbf p_{j|i}</math> so that the numerator is proportional to <math>\, j </math>'s distance from <math>\, i </math> and the denominator is proportional to sum of all probable neighbors' distance from <math>\, i </math>.<br /><br />
<br />
Note that given a set of pairwise distances between objects, <math> \mathbf{|| x_i - x_j ||} </math>, we can use the above equation to derive the same probabilities. In practice, given a set of <math> N </math> points, we set the variance of the Gaussian <math> \mathbf{ \sigma_i ^2} </math>, either by hand or we find it by a binary search for the values of <math> \mathbf{ \sigma_i } </math> that make the entropy of the distribution over neighbours equal to <math> \mathbf{ \log_2 M} </math> (Remember that the entropy of the distribution <math> \mathbf{ P_i} </math> is defined as <math> \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx </math> and <math> \mathbf{ p(x)\log(1/p(x))} </math> is understood to be zero when <math> \mathbf{p(x)=0)} </math>.) This is done by starting from a number <math> \mathbf{ M \ll N} </math> and performing the binary search until the entropy of <math> \mathbf{ P_i} </math> is within some predetermined small tolerance of <math> \mathbf{\log_2 M } </math>. <br />
<br />
Our main goal in SNE is to model <math>\mathbf{p_{j|i}}</math> by using the conditional probabilities <math>\mathbf{q_{j|i}}</math>, which are determined by the locations <math>\mathbf{ y_i} </math> of points in low-dimensional space: <br />
<center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original <br />
<math> \mathbf{p_{j|i}} </math> and induced <math> \mathbf{ q_{j|i}} </math> distributions over neighbours for each object:<br />
<br />
<center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
The dimensionality of the <math> \mathbf{Y} </math> space is chosen to be much less than the number of objects. Notice that making <math> \mathbf{ q_{j|i}} </math> large when <math> \mathbf{ p_{j|i}} </math> is small wastes some of the probability mass in the <math> \mathbf{Q} </math> distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces ''both'' keeping the images of nearby objects nearby ''and'' keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating <math> \mathbf{C} </math> is tedious because <math> \mathbf{y_k} </math> affects <math> \mathbf{ q_{j|i}} </math> via the normalization term in its definition, the final result is simple and has nice physical interpretation:<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
Using the steepest descent for minimizing <math> \mathbf{C} </math> in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the <math> \mathbf{y} </math> values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of <math> \mathbf{C} </math>.<br />
<br />
== Symmetric SNE ==<br />
<br />
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:<br />
<br />
In this case we define <math> \mathbf{p_{ij}} </math> by<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
<math> \mathbf{q_{ij}} </math>'s are defined by<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2) }</math> </center><br />
<br />
and finally the symmetric version of our cost function, <math> \mathbf{C_{sym}} </math>, becomes the KL divergence between the two distributions<br />
<br />
<center> <math> C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
The benefit of defining <math> \mathbf{p_{ij}} </math>'s like this is getting much simpler derivatives. If one of the high-dimensional points, <math> \mathbf{j} </math>, is far from all of the others, all of the <math> \mathbf{p_{.j}} </math> will be very small. In this case, we replace <math> \mathbf{p_{ij}} </math> by <math> \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} </math>. When <math> \mathbf{j} </math> is far from all the other points, all of the <math> \mathbf{p_{j|i}} </math> will be very small, but <math> \mathbf{p_{.|j}} </math> will sum to 1.<br />
<br />
== Aspect Maps ==<br />
<br />
Another approach for defining <math> \mathbf{q_{j|i}} </math> is allowing <math> \mathbf{i} </math> and <math> \mathbf{j} </math> to occur in several different two-dimensional maps and assigning a mixing proportion <math> \mathbf{\pi_{i}^{m}} </math> in m-th map to the object <math> \mathbf{i} </math>. Note that we should have <math> \mathbf{\sum_{m} \pi_{i}^{m}=1} </math>. Now by using these different maps, we define <math> \mathbf{q_{j|i}} </math> as follows:<br />
<br />
<center> <math> q_{j|i} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} </math> </center><br />
<br />
where<br />
<br />
<center> <math> d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} </math> </center><br />
<br />
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high <math> \mathbf{q_{j|i}} </math> no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that ''there is'' at least one map in which <math> \mathbf{i} </math> is close to <math> \mathbf{j} </math> ''and'' provided that the versions of <math> \mathbf{i} </math> and <math> \mathbf{j} </math> in that map have high mixing proportions, it is possible for for <math> \mathbf{q_{j|i}} </math> to be quite large even if <math> \mathbf{i} </math> and <math> \mathbf{j} </math> are far apart in all the other maps. <br />
<br />
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] </math> </center><br />
<br />
Now by substituting the definition of <math> \mathbf{z_k} </math> and reshuffling the terms we will have:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} </math> </center><br />
<br />
In practice, we will not be using mixing proportions <math> \mathbf{\pi_i^m} </math> themselves as parameters of the model; Instead, we define <math> \mathbf{w_i^m} </math> by: <br />
<br />
<center> <math> \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} </math> </center><br />
<br />
as a result of that, the gradient becomes:<br />
<br />
<center> <math> \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] </math> </center><br />
<br />
== Modeling Human Word Association Data ==<br />
<br />
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study <br />
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.<br /><br />
SNE has some probles with ambiguous words such as 'fire', 'wood', and 'job'. Puting 'fire' close to 'wood' , and 'job' where the two laater words are not related is a misconception in SNE. AMSNE has a solution for this kind of problem by considering the word 'fire' as a mixture of two different meanings. 'fire', in one map, is close to 'wood' and in the other is close to 'job'<br />
<br />
== Applications==<br />
<br />
In NIPS 2008 Conference, there was a demonstration by L. van der Maaten and G.Hinton on Visualizing NIPS Cooperations using Multiple Maps t-SNE <ref>http://nips.cc/Conferences/2008/Program/event.php?ID=1472 </ref>. Their demonstration showed visualizations of NIPS co-authorships that were constructed by the multiple maps version of t-SNE. They showed that it is impossible for multidimensional scaling techniques to construct an appropriate visualization of the similarity data, because of the triangle inequality problem and therefore they created multiple maps version of t-SNE based on the idea proposed in this paper. In these maps, each author has a copy , and each copy is weighted by a mixing proportion (the mixing proportions for a single author over all maps sum up to 1). The multiple maps version of t-SNE can deal with the triangle inequality problem, and as a result, it is very good at visualizing NIPS co-authorship data.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2853independent Component Analysis: algorithms and applications2009-07-10T23:48:33Z<p>Myakhave: /* Finding hidden factors in financial data */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restrictive?<br />
<br />
===Independence versus uncorrelatedness===<br />
Two random variables <math>\,y_1, y_2</math> are independent if information on <math>\,y_1</math> doesn't give any information on <math>\,y_2</math>, and vice versa. In math words, <math>\,y_1, y_2</math> are independent if the joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Two random variables <math>\,y_1, y_2</math> are uncorrelated if the covariance is zero:<br /><br />
<math>\,E(y_1y_2) - E(y_1)E(y_2)=0</math><br /><br />
<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
===Finding hidden factors in financial data===<br />
Suppose we have the cashflow of several stores belonging to the same retail chain. The goal is to find the fundamental factors that are common to all stroes and affect the cashflow. In this case, factors like seasonal variation, and prize changes of various coomodities affects on all stores '''independently'''. This is a work from Kiviluoto and Oja(1998)<ref>Kiviluoto, K. & Oja, E. (1998). Independent component analysis for parallel financial time series. Proceeding of the international comference on neural information processing(ICONIP'98) Vol.2 (pp. 895-898). Tokyo, Japan.</ref> applying ICA on the cashflow problem. However, I (as a general contributor) am still thinking that factor independency is a strong and so unreal assumption in this particular case. Imagine a case where prize changes of variuos commodities is mixed with seasonal variations.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2852independent Component Analysis: algorithms and applications2009-07-10T23:47:22Z<p>Myakhave: /* Finding hidden factors in financial data */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restrictive?<br />
<br />
===Independence versus uncorrelatedness===<br />
Two random variables <math>\,y_1, y_2</math> are independent if information on <math>\,y_1</math> doesn't give any information on <math>\,y_2</math>, and vice versa. In math words, <math>\,y_1, y_2</math> are independent if the joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Two random variables <math>\,y_1, y_2</math> are uncorrelated if the covariance is zero:<br /><br />
<math>\,E(y_1y_2) - E(y_1)E(y_2)=0</math><br /><br />
<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
===Finding hidden factors in financial data===<br />
Suppose we have the cashflow of several stores belonging to the same retail chain. The goal is to find the fundamental factors that are common to all stroes and affect the cashflow. In this case, factors like seasonal variation, and prize changes of various coomodities affects on all stores '''independently'''. This is a work from Kiviluoto and Oja(1998)<ref>Kiviluoto, K. & Oja, E. (1998). Independent component analysis for parallel financial time series. Proceeding of the international comference on neural information processing(ICONIP'98) Vol.2 (pp. 895-898). Tokyo, Japan.</ref> applying ICA on the cashflow problem. However, I (as a general contributor) am still thinking that factor independency is a strong and so harmful assumtion in this particular case. Imagine a case where prize changes of variuos commodities is mixed with seasonal variations.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2851independent Component Analysis: algorithms and applications2009-07-10T23:44:07Z<p>Myakhave: /* Applications */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restrictive?<br />
<br />
===Independence versus uncorrelatedness===<br />
Two random variables <math>\,y_1, y_2</math> are independent if information on <math>\,y_1</math> doesn't give any information on <math>\,y_2</math>, and vice versa. In math words, <math>\,y_1, y_2</math> are independent if the joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Two random variables <math>\,y_1, y_2</math> are uncorrelated if the covariance is zero:<br /><br />
<math>\,E(y_1y_2) - E(y_1)E(y_2)=0</math><br /><br />
<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
===Finding hidden factors in financial data===<br />
Suppose we have the cashflow of several stores belonging to the same retail chain. The goal is to find the fundamental factors that are common to all stroes and affect the cashflow. In this case, factors like seasonal variation, and prize changes of various coomodities affects on all stores '''independently'''. This is a work from Kiviluoto and Oja(1998) applying ICA on the cashflow problem. However, I (as a general contributor) am still thinking that factor independency is a strong and so harmful assumtion in this particular case. Imagine a case where prize changes of variuos commodities is mixed with seasonal variations.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2850independent Component Analysis: algorithms and applications2009-07-10T23:27:01Z<p>Myakhave: /* Why Gaussian variables are forbidden */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restrictive?<br />
<br />
===Independence versus uncorrelatedness===<br />
Two random variables <math>\,y_1, y_2</math> are independent if information on <math>\,y_1</math> doesn't give any information on <math>\,y_2</math>, and vice versa. In math words, <math>\,y_1, y_2</math> are independent if the joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Two random variables <math>\,y_1, y_2</math> are uncorrelated if the covariance is zero:<br /><br />
<math>\,E(y_1y_2) - E(y_1)E(y_2)=0</math><br /><br />
<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2847independent Component Analysis: algorithms and applications2009-07-10T23:21:45Z<p>Myakhave: /* Independence versus uncorrelatedness */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restricted?<br />
<br />
===Independence versus uncorrelatedness===<br />
Two random variables <math>\,y_1, y_2</math> are independent if information on <math>\,y_1</math> doesn't give any information on <math>\,y_2</math>, and vice versa. In math words, <math>\,y_1, y_2</math> are independent if the joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Two random variables <math>\,y_1, y_2</math> are uncorrelated if the covariance is zero:<br /><br />
<math>\,E(y_1y_2) - E(y_1)E(y_2)=0</math><br /><br />
<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2846independent Component Analysis: algorithms and applications2009-07-10T23:20:57Z<p>Myakhave: /* Independence versus uncorrelatedness */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restricted?<br />
<br />
===Independence versus uncorrelatedness===<br />
Two random variables <math>\,y_1, y_2</math> are independent if information on <math>\,y_1</math> doesn't give any information on <math>\,y_2</math>, and vice versa. In math words, <math>\,y_1, y_2</math> are independent if the joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Two random variables <math>\,y_1, y_2</math> are uncorrelated if the covariance is zero:<br /><br />
<math>\,E(y_1y_2) - E(y_1)E(y_2)=0</math><br /><br />
<br />
joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2844independent Component Analysis: algorithms and applications2009-07-10T23:16:22Z<p>Myakhave: /* Independence versus uncorrelatedness */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restricted?<br />
<br />
===Independence versus uncorrelatedness===<br />
Two random variables <math>\,y_1, y_2</math> are independent if information on <math>\,y_1</math> doesn't give any information on <math>\,y_2</math>, and vice versa. In math words, <math>\,y_1, y_2</math> are independent if the joint probability density function can be written as the multiplication of each probability denity function:<br /><br />
<math>\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)</math><br /><br />
<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=independent_Component_Analysis:_algorithms_and_applications&diff=2842independent Component Analysis: algorithms and applications2009-07-10T23:06:28Z<p>Myakhave: /* Why Gaussian variables are forbidden */</p>
<hr />
<div>==Motivation==<br />
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by <math>s_1(t) \,</math> and <math>s_2(t)\,</math> and the recorded signals by <math> x_1(t) \,</math> and <math>x_2(t) \,</math>, we can assume the linear relation <math>x = As \,</math>, where <math>A \,</math> is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both <math>A\,</math> and <math>s\,</math> using only the recorded signals <math>x\,</math> is called the ''cocktail-party problem'', which is the signature problem for '''ICA'''.<br />
<br />
==Introduction==<br />
'''ICA''' shows, perhaps surprisingly, that the ''cocktail-party problem'' can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the ''cocktail-party problem'' because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix <math>\ A</math>, that is, we want to find components which are as statistically independent and non-Gaussian as possible.<br />
<br />
'''ICA''' has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.<br />
<br />
'''Relationship with Dimension Reduction'''<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref><br />
<br>Suppose we have <math>n</math> oberserved signals <math>\ x_i</math> where <math>\ i=1,...,n</math> from mixing <math>\ m</math> source signals <math>\ y_i</math>, where<math>\ i=1,...,m</math>,<br />
<br>we want to find such a transformation matrix <math>\ W</math>, that for a given number of dimensions <math>\ d</math><br />
<br><math>\ y'=Wx</math>, where <math>\ y'</math> is a <math>\ d \times 1</math> vector.<br />
<br>the transformed variable <math>\ y'_i</math> is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.<br />
<br />
'''Concerns'''<br />
<br>The ''cocktail-party problem'' or ''blind source separation problem'' means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan [[Learning Spectral Clustering, With Application To Speech Separation ]]. One of the concern of ICA is that if this is the case, where the matrix <math>\ A</math> is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?<br />
<br />
<br />
===Definition of ICA===<br />
The '''ICA''' model assumes a linear mixing model <math> x = As \,</math>, where <math>x \,</math> is a random vector of observed signals, <math>A \,</math> is a square matrix of constant parameters, and <math>s \,</math> is a random vector of statistically independent source signals. Each component of <math>s</math> is a source signal. Note that the restriction of <math> A \,</math> being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.<br />
<br />
===Ambiguities of ICA===<br />
Because both <math>A \,</math> and <math>s \,</math> are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and '''ICA''' can as well just fix the sign and assume unit variance of the components.<br />
<br />
===Why Gaussian variables are forbidden===<br />
In this section we show that '''ICA''' cannot resolve independent components which have Gaussian distributions.<br />
<br />
To see this, assume that the two source signals <math>s_1 \,</math> and <math>s_2 \,</math> are Gaussian and the mixing matrix <math>A\,</math> is orthogonal. Then the observed signals <math>x_1 \,</math> and <math>x_2 \,</math> will have joint density given by <math>p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})</math>, which is rotationally symmetric. In other words, the joint density is be the same for '''any''' orthogonal mixing matrix. This means that in the case of Gaussian variables, '''ICA''' can only determine the mixing matrix up to an orthogonal transformation.<br />
<br />
The fact that '''ICA''' cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.<br /><br />
In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restricted?<br />
<br />
===Independence versus uncorrelatedness===<br />
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.<br />
<br />
'''Result 1:''' Two random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if any bounded continuous functions of <math>X \,</math> and <math>Y \,</math> are uncorrelated.<br />
<br />
'''Result 2:''' Two Gaussian random variables <math>X \,</math> and <math>Y \,</math> are independent if and only if they are uncorrelated.<br />
<br />
===Data Whitening===<br />
Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.<br />
<br />
===ICA Estimation Principles===<br />
<br />
====Principle 1: Nonlinear decorrelation====<br />
<br />
From the above discussion, we see that we can estimate the mixing matrix <math>A \,</math> by finding a matrix <math>W \,</math> such that for any <math> i \neq j \,</math>, and suitable nonlinear functions <math>g \,</math> and <math>h \,</math>, <math>g(y_i) \,</math> and <math>h(y_j) \,</math> are uncorrelated.<br />
<br />
====Principle 2: Maximizing Non-gaussanity====<br />
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals <math> s \,</math>. Using this observation, we can find the original signals from the observed signals <math>x \,</math> as follows: find the weighting vectors <math>w \,</math> such that the <math>w^T x \,</math> are the most non-gaussian.<br />
<br />
==Measures of non-Gaussianity==<br />
<br />
===kurtosis===<br />
Kurtosis is the classical measure of non-Gaussianity which is defined by<br />
<math>kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,</math>.<br />
Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution);<br />
Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)<br />
<br />
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.<br />
<br />
===negentropy===<br />
====Intuitive explanation====<br />
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.<br />
<br />
====Technical explanation====<br />
The entropy of a discrete random variable <math>X \,</math> with possible values <math>\{x_1, x_2, ..., x_n\} \,</math> is defined as <math>H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}</math><br />
<br />
The (differential) entropy of a continuous random variable <math>X \,</math> with probability density function <math>f \,</math> is similarly defined as <math>H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx</math><br />
<br />
It is obvious how the definition of differential entropy can be extended to higher dimensions.<br />
<br />
For a random vector <math>y\,</math> with covariance matrix <math>C \,</math>, its negentropy is defined as <math> J(y) = H(Gaussian_C) - H(y) \,</math>, where <math>Gaussian_C \,</math> denotes the Gaussian distribution with covariance matrix <math>C \,</math>. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.<br />
<br />
====Empirical estimation of negentropy====<br />
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.<br />
<br />
==A brief history of ICA==<br />
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.<br />
<br />
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>==<br />
<br />
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.<br />
<br />
==Applications==<br />
ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Similarity_Data_with_a_Mixture_of_Maps&diff=2812visualizing Similarity Data with a Mixture of Maps2009-07-10T15:44:49Z<p>Myakhave: /* Stochastic Neighbour Embedding */</p>
<hr />
<div>== Introduction ==<br />
<br />
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.<br />
<br />
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.<br />
<br />
== Stochastic Neighbour Embedding ==<br />
<br />
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref><br />
lies in converting high-dimensional distance or similarity data into a set of <math> \mathbf{ p_{j|i} }</math>, each of which represent the probability that one object <math> i </math> pick another object <math> j </math> as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find <math> \mathbf{ p_{j|i} } </math> for each object <math> i </math> by using a spherical Gaussian distribution centered at the high-dimensional position of <math> i </math>, <math> \mathbf{ X_{i}} </math>. We will set <math> \mathbf{ p_{i|i} = 0 }</math> and for <math> \mathbf{ j \neq i } </math>,<br />
<br />
<center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center><br />
<br />
Intuitively, if one object, say <math>\, i </math>, is only allowed to pick one neighbor, say <math>\, j </math>, <math>\, j </math> should be the best one with the minimum relative distance. In other words, the more the relative distance from <math>\, i </math>, the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define <math> \mathbf p_{j|i}</math> so that the numerator is proportional to <math>\, j </math>'s distance from <math>\, i </math> and the denominator is proportional to sum of all probable neighbors' distance from <math>\, i </math>.<br /><br />
<br />
Note that given a set of pairwise distances between objects, <math> \mathbf{|| x_i - x_j ||} </math>, we can use the above equation to derive the same probabilities. In practice, given a set of <math> N </math> points, we set the variance of the Gaussian <math> \mathbf{ \sigma_i ^2} </math>, either by hand or we find it by a binary search for the values of <math> \mathbf{ \sigma_i } </math> that make the entropy of the distribution over neighbours equal to <math> \mathbf{ \log_2 M} </math> (Remember that the entropy of the distribution <math> \mathbf{ P_i} </math> is defined as <math> \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx </math> and <math> \mathbf{ p(x)\log(1/p(x))} </math> is understood to be zero when <math> \mathbf{p(x)=0)} </math>.) This is done by starting from a number <math> \mathbf{ M \ll N} </math> and performing the binary search until the entropy of <math> \mathbf{ P_i} </math> is within some predetermined small tolerance of <math> \mathbf{\log_2 M } </math>. <br />
<br />
Our main goal in SNE is to model <math>\mathbf{p_{j|i}}</math> by using the conditional probabilities <math>\mathbf{q_{j|i}}</math>, which are determined by the locations <math>\mathbf{ y_i} </math> of points in low-dimensional space: <br />
<center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original <br />
<math> \mathbf{p_{j|i}} </math> and induced <math> \mathbf{ q_{j|i}} </math> distributions over neighbours for each object:<br />
<br />
<center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
The dimensionality of the <math> \mathbf{Y} </math> space is chosen to be much less than the number of objects. Notice that making <math> \mathbf{ q_{j|i}} </math> large when <math> \mathbf{ p_{j|i}} </math> is small wastes some of the probability mass in the <math> \mathbf{Q} </math> distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces ''both'' keeping the images of nearby objects nearby ''and'' keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating <math> \mathbf{C} </math> is tedious because <math> \mathbf{y_k} </math> affects <math> \mathbf{ q_{j|i}} </math> via the normalization term in its definition, the final result is simple and has nice physical interpretation:<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
Using the steepest descent for minimizing <math> \mathbf{C} </math> in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the <math> \mathbf{y} </math> values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of <math> \mathbf{C} </math>.<br />
<br />
== Symmetric SNE ==<br />
<br />
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:<br />
<br />
In this case we define <math> \mathbf{p_{ij}} </math> by<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
<math> \mathbf{q_{ij}} </math>'s are defined by<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2) }</math> </center><br />
<br />
and finally the symmetric version of our cost function, <math> \mathbf{C_{sym}} </math>, becomes the KL divergence between the two distributions<br />
<br />
<center> <math> C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
The benefit of defining <math> \mathbf{p_{ij}} </math>'s like this is getting much simpler derivatives. If one of the high-dimensional points, <math> \mathbf{j} </math>, is far from all of the others, all of the <math> \mathbf{p_{.j}} </math> will be very small. In this case, we replace <math> \mathbf{p_{ij}} </math> by <math> \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} </math>. When <math> \mathbf{j} </math> is far from all the other points, all of the <math> \mathbf{p_{j|i}} </math> will be very small, but <math> \mathbf{p_{.|j}} </math> will sum to 1.<br />
<br />
== Aspect Maps ==<br />
<br />
Another approach for defining <math> \mathbf{q_{j|i}} </math> is allowing <math> \mathbf{i} </math> and <math> \mathbf{j} </math> to occur in several different two-dimensional maps and assigning a mixing proportion <math> \mathbf{\pi_{i}^{m}} </math> in m-th map to the object <math> \mathbf{i} </math>. Note that we should have <math> \mathbf{\sum_{m} \pi_{i}^{m}=1} </math>. Now by using these different maps, we define <math> \mathbf{q_{j|i}} </math> as follows:<br />
<br />
<center> <math> q_{i|j} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} </math> </center><br />
<br />
where<br />
<br />
<center> <math> d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} </math> </center><br />
<br />
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high <math> \mathbf{q_{j|i}} </math> no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that ''there is'' at least one map in which <math> \mathbf{i} </math> is close to <math> \mathbf{j} </math> ''and'' provided that the versions of <math> \mathbf{i} </math> and <math> \mathbf{j} </math> in that map have high mixing proportions, it is possible for for <math> \mathbf{q_{j|i}} </math> to be quite large even if <math> \mathbf{i} </math> and <math> \mathbf{j} </math> are far apart in all the other maps. <br />
<br />
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] </math> </center><br />
<br />
Now by substituting the definition of <math> \mathbf{z_k} </math> and reshuffling the terms we will have:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} </math> </center><br />
<br />
In practice, we will not be using mixing proportions <math> \mathbf{\pi_i^m} </math> themselves as parameters of the model; Instead, we define <math> \mathbf{w_i^m} </math> by: <br />
<br />
<center> <math> \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} </math> </center><br />
<br />
as a result of that, the gradient becomes:<br />
<br />
<center> <math> \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] </math> </center><br />
<br />
== Modeling Human Word Association Data ==<br />
<br />
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study <br />
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Similarity_Data_with_a_Mixture_of_Maps&diff=2809visualizing Similarity Data with a Mixture of Maps2009-07-10T15:39:42Z<p>Myakhave: /* Stochastic Neighbour Embedding */</p>
<hr />
<div>== Introduction ==<br />
<br />
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.<br />
<br />
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.<br />
<br />
== Stochastic Neighbour Embedding ==<br />
<br />
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref><br />
lies in converting high-dimensional distance or similarity data into a set of <math> \mathbf{ p_{j|i} }</math>, each of which represent the probability that one object <math> i </math> pick another object <math> j </math> as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find <math> \mathbf{ p_{j|i} } </math> for each object <math> i </math> by using a spherical Gaussian distribution centered at the high-dimensional position of <math> i </math>, <math> \mathbf{ X_{i}} </math>. We will set <math> \mathbf{ p_{i|i} = 0 }</math> and for <math> \mathbf{ j \neq i } </math>,<br />
<br />
<center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center><br />
<br />
Intuitively, if one object,say <math>\, i </math>, is only allowed to pick one neighbor, say <math>\, j </math>, <math>\, j </math> should be the best one with the minimum relative distance. In other words, the more the relative distance from <math>\, i </math>, the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define <math> \mathbf p_{j|i}</math> so that the nominator is proportional to <math>\, j </math> distance from <math>\, i </math> and the denominator is proportional to sum of all probable neighbors distance from <math>\, i </math>.<br /><br />
<br />
Note that given a set of pairwise distances between objects, <math> \mathbf{|| x_i - x_j ||} </math>, we can use the above equation to derive the same probabilities. In practice, given a set of <math> N </math> points, we set the variance of the Gaussian <math> \mathbf{ \sigma_i ^2} </math>, either by hand or we find it by a binary search for the values of <math> \mathbf{ \sigma_i } </math> that make the entropy of the distribution over neighbours equal to <math> \mathbf{ \log_2 M} </math> (Remember that the entropy of the distribution <math> \mathbf{ P_i} </math> is defined as <math> \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx </math> and <math> \mathbf{ p(x)\log(1/p(x))} </math> is understood to be zero when <math> \mathbf{p(x)=0)} </math>.) This is done by starting from a number <math> \mathbf{ M \ll N} </math> and performing the binary search until the entropy of <math> \mathbf{ P_i} </math> is within some predetermined small tolerance of <math> \mathbf{\log_2 M } </math>. <br />
<br />
Our main goal in SNE is to model <math>\mathbf{p_{j|i}}</math> by using the conditional probabilities <math>\mathbf{q_{j|i}}</math>, which are determined by the locations <math>\mathbf{ y_i} </math> of points in low-dimensional space: <br />
<center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original <br />
<math> \mathbf{p_{j|i}} </math> and induced <math> \mathbf{ q_{j|i}} </math> distributions over neighbours for each object:<br />
<br />
<center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
The dimensionality of the <math> \mathbf{Y} </math> space is chosen to be much less than the number of objects. Notice that making <math> \mathbf{ q_{j|i}} </math> large when <math> \mathbf{ p_{j|i}} </math> is small wastes some of the probability mass in the <math> \mathbf{Q} </math> distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces ''both'' keeping the images of nearby objects nearby ''and'' keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating <math> \mathbf{C} </math> is tedious because <math> \mathbf{y_k} </math> affects <math> \mathbf{ q_{j|i}} </math> via the normalization term in its definition, the final result is simple and has nice physical interpretation:<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
Using the steepest descent for minimizing <math> \mathbf{C} </math> in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the <math> \mathbf{y} </math> values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of <math> \mathbf{C} </math>.<br />
<br />
== Symmetric SNE ==<br />
<br />
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:<br />
<br />
In this case we define <math> \mathbf{p_{ij}} </math> by<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
<math> \mathbf{q_{ij}} </math>'s are defined by<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2) }</math> </center><br />
<br />
and finally the symmetric version of our cost function, <math> \mathbf{C_{sym}} </math>, becomes the KL divergence between the two distributions<br />
<br />
<center> <math> C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
The benefit of defining <math> \mathbf{p_{ij}} </math>'s like this is getting much simpler derivatives. If one of the high-dimensional points, <math> \mathbf{j} </math>, is far from all of the others, all of the <math> \mathbf{p_{.j}} </math> will be very small. In this case, we replace <math> \mathbf{p_{ij}} </math> by <math> \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} </math>. When <math> \mathbf{j} </math> is far from all the other points, all of the <math> \mathbf{p_{j|i}} </math> will be very small, but <math> \mathbf{p_{.|j}} </math> will sum to 1.<br />
<br />
== Aspect Maps ==<br />
<br />
Another approach for defining <math> \mathbf{q_{j|i}} </math> is allowing <math> \mathbf{i} </math> and <math> \mathbf{j} </math> to occur in several different two-dimensional maps and assigning a mixing proportion <math> \mathbf{\pi_{i}^{m}} </math> in m-th map to the object <math> \mathbf{i} </math>. Note that we should have <math> \mathbf{\sum_{m} \pi_{i}^{m}=1} </math>. Now by using these different maps, we define <math> \mathbf{q_{j|i}} </math> as follows:<br />
<br />
<center> <math> q_{i|j} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} </math> </center><br />
<br />
where<br />
<br />
<center> <math> d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} </math> </center><br />
<br />
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high <math> \mathbf{q_{j|i}} </math> no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that ''there is'' at least one map in which <math> \mathbf{i} </math> is close to <math> \mathbf{j} </math> ''and'' provided that the versions of <math> \mathbf{i} </math> and <math> \mathbf{j} </math> in that map have high mixing proportions, it is possible for for <math> \mathbf{q_{j|i}} </math> to be quite large even if <math> \mathbf{i} </math> and <math> \mathbf{j} </math> are far apart in all the other maps. <br />
<br />
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] </math> </center><br />
<br />
Now by substituting the definition of <math> \mathbf{z_k} </math> and reshuffling the terms we will have:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} </math> </center><br />
<br />
In practice, we will not be using mixing proportions <math> \mathbf{\pi_i^m} </math> themselves as parameters of the model; Instead, we define <math> \mathbf{w_i^m} </math> by: <br />
<br />
<center> <math> \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} </math> </center><br />
<br />
as a result of that, the gradient becomes:<br />
<br />
<center> <math> \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] </math> </center><br />
<br />
== Modeling Human Word Association Data ==<br />
<br />
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study <br />
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Similarity_Data_with_a_Mixture_of_Maps&diff=2807visualizing Similarity Data with a Mixture of Maps2009-07-10T15:38:00Z<p>Myakhave: /* Stochastic Neighbour Embedding */</p>
<hr />
<div>== Introduction ==<br />
<br />
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.<br />
<br />
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.<br />
<br />
== Stochastic Neighbour Embedding ==<br />
<br />
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref><br />
lies in converting high-dimensional distance or similarity data into a set of <math> \mathbf{ p_{j|i} }</math>, each of which represent the probability that one object <math> i </math> pick another object <math> j </math> as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find <math> \mathbf{ p_{j|i} } </math> for each object <math> i </math> by using a spherical Gaussian distribution centered at the high-dimensional position of <math> i </math>, <math> \mathbf{ X_{i}} </math>. We will set <math> \mathbf{ p_{i|i} = 0 }</math> and for <math> \mathbf{ j \neq i } </math>,<br />
<br />
<center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center><br />
<br />
Intuitively, if one object,say <math> i </math>, is only allowed to pick one neighbor, say <math> j </math>, <math> j </math> should be the best one with the minimum relative distance. In other words, the more the relative distance from <math> i </math>, the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define <math> \mathbf p_{j|i}</math> so that the nominator is proportional to <math> j </math> distance from <math> i </math> and the denominator is proportional to sum of all probable neighbors distance from <math> i </math>.<br />
<br />
Note that given a set of pairwise distances between objects, <math> \mathbf{|| x_i - x_j ||} </math>, we can use the above equation to derive the same probabilities. In practice, given a set of <math> N </math> points, we set the variance of the Gaussian <math> \mathbf{ \sigma_i ^2} </math>, either by hand or we find it by a binary search for the values of <math> \mathbf{ \sigma_i } </math> that make the entropy of the distribution over neighbours equal to <math> \mathbf{ \log_2 M} </math> (Remember that the entropy of the distribution <math> \mathbf{ P_i} </math> is defined as <math> \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx </math> and <math> \mathbf{ p(x)\log(1/p(x))} </math> is understood to be zero when <math> \mathbf{p(x)=0)} </math>.) This is done by starting from a number <math> \mathbf{ M \ll N} </math> and performing the binary search until the entropy of <math> \mathbf{ P_i} </math> is within some predetermined small tolerance of <math> \mathbf{\log_2 M } </math>. <br />
<br />
Our main goal in SNE is to model <math>\mathbf{p_{j|i}}</math> by using the conditional probabilities <math>\mathbf{q_{j|i}}</math>, which are determined by the locations <math>\mathbf{ y_i} </math> of points in low-dimensional space: <br />
<center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original <br />
<math> \mathbf{p_{j|i}} </math> and induced <math> \mathbf{ q_{j|i}} </math> distributions over neighbours for each object:<br />
<br />
<center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
The dimensionality of the <math> \mathbf{Y} </math> space is chosen to be much less than the number of objects. Notice that making <math> \mathbf{ q_{j|i}} </math> large when <math> \mathbf{ p_{j|i}} </math> is small wastes some of the probability mass in the <math> \mathbf{Q} </math> distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces ''both'' keeping the images of nearby objects nearby ''and'' keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating <math> \mathbf{C} </math> is tedious because <math> \mathbf{y_k} </math> affects <math> \mathbf{ q_{j|i}} </math> via the normalization term in its definition, the final result is simple and has nice physical interpretation:<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
Using the steepest descent for minimizing <math> \mathbf{C} </math> in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the <math> \mathbf{y} </math> values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of <math> \mathbf{C} </math>.<br />
<br />
== Symmetric SNE ==<br />
<br />
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:<br />
<br />
In this case we define <math> \mathbf{p_{ij}} </math> by<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
<math> \mathbf{q_{ij}} </math>'s are defined by<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2) }</math> </center><br />
<br />
and finally the symmetric version of our cost function, <math> \mathbf{C_{sym}} </math>, becomes the KL divergence between the two distributions<br />
<br />
<center> <math> C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
The benefit of defining <math> \mathbf{p_{ij}} </math>'s like this is getting much simpler derivatives. If one of the high-dimensional points, <math> \mathbf{j} </math>, is far from all of the others, all of the <math> \mathbf{p_{.j}} </math> will be very small. In this case, we replace <math> \mathbf{p_{ij}} </math> by <math> \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} </math>. When <math> \mathbf{j} </math> is far from all the other points, all of the <math> \mathbf{p_{j|i}} </math> will be very small, but <math> \mathbf{p_{.|j}} </math> will sum to 1.<br />
<br />
== Aspect Maps ==<br />
<br />
Another approach for defining <math> \mathbf{q_{j|i}} </math> is allowing <math> \mathbf{i} </math> and <math> \mathbf{j} </math> to occur in several different two-dimensional maps and assigning a mixing proportion <math> \mathbf{\pi_{i}^{m}} </math> in m-th map to the object <math> \mathbf{i} </math>. Note that we should have <math> \mathbf{\sum_{m} \pi_{i}^{m}=1} </math>. Now by using these different maps, we define <math> \mathbf{q_{j|i}} </math> as follows:<br />
<br />
<center> <math> q_{i|j} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} </math> </center><br />
<br />
where<br />
<br />
<center> <math> d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} </math> </center><br />
<br />
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high <math> \mathbf{q_{j|i}} </math> no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that ''there is'' at least one map in which <math> \mathbf{i} </math> is close to <math> \mathbf{j} </math> ''and'' provided that the versions of <math> \mathbf{i} </math> and <math> \mathbf{j} </math> in that map have high mixing proportions, it is possible for for <math> \mathbf{q_{j|i}} </math> to be quite large even if <math> \mathbf{i} </math> and <math> \mathbf{j} </math> are far apart in all the other maps. <br />
<br />
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] </math> </center><br />
<br />
Now by substituting the definition of <math> \mathbf{z_k} </math> and reshuffling the terms we will have:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} </math> </center><br />
<br />
In practice, we will not be using mixing proportions <math> \mathbf{\pi_i^m} </math> themselves as parameters of the model; Instead, we define <math> \mathbf{w_i^m} </math> by: <br />
<br />
<center> <math> \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} </math> </center><br />
<br />
as a result of that, the gradient becomes:<br />
<br />
<center> <math> \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] </math> </center><br />
<br />
== Modeling Human Word Association Data ==<br />
<br />
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study <br />
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=visualizing_Similarity_Data_with_a_Mixture_of_Maps&diff=2806visualizing Similarity Data with a Mixture of Maps2009-07-10T15:13:58Z<p>Myakhave: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.<br />
<br />
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.<br />
<br />
== Stochastic Neighbour Embedding ==<br />
<br />
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref><br />
lies in converting high-dimensional distance or similarity data into a set of <math> \mathbf{ p_{j|i} }</math>, each of which represent the probability that one object <math> i </math> pick another object <math> j </math> as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find <math> \mathbf{ p_{j|i} } </math> for each object <math> i </math> by using a spherical Gaussian distribution centered at the high-dimensional position of <math> i </math>, <math> \mathbf{ X_{i}} </math>. We will set <math> \mathbf{ p_{i|i} = 0 }</math> and for <math> \mathbf{ j \neq i } </math>,<br />
<br />
<center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center><br />
<br />
Note that given a set pairwise distances between objects, <math> \mathbf{|| x_i - x_j ||} </math>, we can use the above equation to derive the same probabilities. In practice, given a set of <math> N </math> points, we set the variance of the Gaussian <math> \mathbf{ \sigma_i ^2} </math>, either by hand or we find it by a binary search for the values of <math> \mathbf{ \sigma_i } </math> that make the entropy of the distribution over neighbours equal to <math> \mathbf{ \log_2 M} </math> (Remember that the entropy of the distribution <math> \mathbf{ P_i} </math> is defined as <math> \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx </math> and <math> \mathbf{ p(x)\log(1/p(x))} </math> is understood to be zero when <math> \mathbf{p(x)=0)} </math>.) This is done by starting from a number <math> \mathbf{ M \ll N} </math> and performing the binary search until the entropy of <math> \mathbf{ P_i} </math> is within some predetermined small tolerance of <math> \mathbf{\log_2 M } </math>. <br />
<br />
Our main goal in SNE is to model <math>\mathbf{p_{j|i}}</math> by using the conditional probabilities <math>\mathbf{q_{j|i}}</math>, which are determined by the locations <math>\mathbf{ y_i} </math> of points in low-dimensional space: <br />
<center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center><br />
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original <br />
<math> \mathbf{p_{j|i}} </math> and induced <math> \mathbf{ q_{j|i}} </math> distributions over neighbours for each object:<br />
<br />
<center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center><br />
<br />
The dimensionality of the <math> \mathbf{Y} </math> space is chosen to be much less than the number of objects. Notice that making <math> \mathbf{ q_{j|i}} </math> large when <math> \mathbf{ p_{j|i}} </math> is small wastes some of the probability mass in the <math> \mathbf{Q} </math> distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces ''both'' keeping the images of nearby objects nearby ''and'' keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating <math> \mathbf{C} </math> is tedious because <math> \mathbf{y_k} </math> affects <math> \mathbf{ q_{j|i}} </math> via the normalization term in its definition, the final result is simple and has nice physical interpretation:<br />
<br />
<center> <math> \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) </math> </center><br />
<br />
Using the steepest descent for minimizing <math> \mathbf{C} </math> in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the <math> \mathbf{y} </math> values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of <math> \mathbf{C} </math>.<br />
<br />
== Symmetric SNE ==<br />
<br />
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:<br />
<br />
In this case we define <math> \mathbf{p_{ij}} </math> by<br />
<br />
<center> <math> \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) }</math> </center><br />
<br />
<math> \mathbf{q_{ij}} </math>'s are defined by<br />
<br />
<center> <math> \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k < l} \exp(-||x_k-x_l ||^2) }</math> </center><br />
<br />
and finally the symmetric version of our cost function, <math> \mathbf{C_{sym}} </math>, becomes the KL divergence between the two distributions<br />
<br />
<center> <math> C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}</math> </center><br />
<br />
The benefit of defining <math> \mathbf{p_{ij}} </math>'s like this is getting much simpler derivatives. If one of the high-dimensional points, <math> \mathbf{j} </math>, is far from all of the others, all of the <math> \mathbf{p_{.j}} </math> will be very small. In this case, we replace <math> \mathbf{p_{ij}} </math> by <math> \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} </math>. When <math> \mathbf{j} </math> is far from all the other points, all of the <math> \mathbf{p_{j|i}} </math> will be very small, but <math> \mathbf{p_{.|j}} </math> will sum to 1.<br />
<br />
== Aspect Maps ==<br />
<br />
Another approach for defining <math> \mathbf{q_{j|i}} </math> is allowing <math> \mathbf{i} </math> and <math> \mathbf{j} </math> to occur in several different two-dimensional maps and assigning a mixing proportion <math> \mathbf{\pi_{i}^{m}} </math> in m-th map to the object <math> \mathbf{i} </math>. Note that we should have <math> \mathbf{\sum_{m} \pi_{i}^{m}=1} </math>. Now by using these different maps, we define <math> \mathbf{q_{j|i}} </math> as follows:<br />
<br />
<center> <math> q_{i|j} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} </math> </center><br />
<br />
where<br />
<br />
<center> <math> d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} </math> </center><br />
<br />
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high <math> \mathbf{q_{j|i}} </math> no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that ''there is'' at least one map in which <math> \mathbf{i} </math> is close to <math> \mathbf{j} </math> ''and'' provided that the versions of <math> \mathbf{i} </math> and <math> \mathbf{j} </math> in that map have high mixing proportions, it is possible for for <math> \mathbf{q_{j|i}} </math> to be quite large even if <math> \mathbf{i} </math> and <math> \mathbf{j} </math> are far apart in all the other maps. <br />
<br />
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] </math> </center><br />
<br />
Now by substituting the definition of <math> \mathbf{z_k} </math> and reshuffling the terms we will have:<br />
<br />
<center> <math> \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} </math> </center><br />
<br />
In practice, we will not be using mixing proportions <math> \mathbf{\pi_i^m} </math> themselves as parameters of the model; Instead, we define <math> \mathbf{w_i^m} </math> by: <br />
<br />
<center> <math> \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} </math> </center><br />
<br />
as a result of that, the gradient becomes:<br />
<br />
<center> <math> \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] </math> </center><br />
<br />
== Modeling Human Word Association Data ==<br />
<br />
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study <br />
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonlinear_Dimensionality_Reduction_by_Semidefinite_Programming_and_Kernel_Matrix_Factorization&diff=2805nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization2009-07-10T14:59:08Z<p>Myakhave: /* Experimental Results */</p>
<hr />
<div>==Introduction ==<br />
In recent work, Semidefinite programming (SDE) that learns a kernel matrix by maximizing variance while preserving the distance and angles between nearest neigbors has been introduced. Although it has many advantages such as optimization convexity, it sufferes from computational cost for large problems. In this paper <ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Proceedings of the the Tenth international workshop on Artificial Intelligence and Statistics (AISTATS-05), pages 381-388, Barbados, West Indies, 2005.</ref><br />
, a new framework based on factorization of the entire kernel matrix in terms of a much smaller submatrix of inner products between randomly chosen landmarks has been proposed.<br />
<br />
==Semidefinite Embedding==<br />
Given high dimensional vectors <math>\,\{x_1, x_2, ..., x_n\} \in R^D </math> lying on or near a manifold that can be embeded in <math>\,d\ll D</math> dimensions to produce low dimensional vectors <math>\,\{y_1, y_2, ..., y_n\} \in R^d</math>, the goal is to find d and produce an appropriate embedding. The algorithm starts with computing k-nearest neighbors of each input and adding a constraint to preserve distances and angles between k-nearest neighbors:<br /><br />
<math>\,\|y_i-y_j\|^2 = \|x_i-x_j\|^2 </math> for all (i,j) in k-nearest neighbors (2)<br /><br />
and also a constraint on outputs to be centerd on the origin:<br /><br />
<math>\,\sum_i{y_i} = 0 </math> (3) <br /><br />
And maximizining the variance of outputs will be the final step:<br />
<math>\,var(y)=\sum_i{\|y_i\|^2} </math> (4)<br /><br />
Assuming <math>\,K_{ij}=y_i.y_j</math>, the above optimzation problem can be reformulated as an SDE:<br /><br />
Maximize trace(K) subject to :<br /><br />
1) <math>\,K\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}K_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,K_{ij}-2K_{ij}+K_{jj}=\|x_i-x_j\|^2</math><br /><br />
the top d eigenvalues and eigenvectors of the kernel matrix will be used to derive the embedding. Here, learning the kernel matrix dominates the total computation time.<br /><br />
<br />
==Kernel Matrix Factorization==<br />
As we saw in the last section, learning the kernel matrix takes many time. Therefore, if we can approximate it by a much smaller matrix , the computation time will be drastically improved. This can be done through the following steps:<br /><br />
First, reconstructing high dimensional datasate <math>\,\{x_i\}_{i=1}^n</math> from m randomly chosen landmarks <math>\,\{\mu_{\alpha}\}_{\alpha=1}^m</math> :<br /><br />
<math>\,\hat{x_i} = \sum_{\alpha}{Q_{i\alpha}\mu_{\alpha}} </math><br /><br />
Based on the similar intuition from locally linear embedding (LLE), the same linear transformation (Q) can be used to reconstruct the output:<br /><br />
<math>\,\hat{y_i} = \sum_{\alpha}{Q_{i\alpha}l_{\alpha}} </math> (6)<br /> <br />
Now, if we make the approximation:<br /><br />
<math>\,K_{ij}=y_i.y_j=\hat{y_i}.\hat{y_j}</math> (7)<br /><br />
substituting (6) into (7) gives <math>\,K\approx QLQ^T</math> where <math>\,L_{\alpha\beta} = l_{\alpha}.l_{\beta}</math><br />
<br />
==Reconstructing from landmarks==<br />
To derive Q, we assume that the manifold can be locally approximated by a linear subspace. Therefore, each input in the high dimensional space can be reconstructed by a weighted sum of its r-nearest neighbors. These weights are found by :<br /><br />
Minimize :<math>\,\varepsilon(W)=\sum_i{\|x_i-\Sigma_j{W_{ij}x_j}\|^2}</math><br /><br />
subject to:<math>\,\Sigma_j{W_{ij}}=1</math> for all j<br /><br />
and <math>\,W_{ij}=0</math> if <math>\,x_i</math> are not r-nearest neighbor of <math>\,x_j</math><br /><br />
Rewriting the reconstruction error as function of inputs will give us: <br />
<math>\,\varepsilon(X)=\sum_{ij}{\phi_{ij}x_i.x_j}</math><br /><br />
where <math>\,\phi = (I_n-W)^T(I_n-W)</math> or <br /><br />
<math>\,\phi=\begin{bmatrix} \overbrace{\phi^{ll}}^{m} & \overbrace{\phi^{lu}}^{n-m} \\ \phi^{ul} & \phi^{uu}\end{bmatrix} </math><br /><br />
<br />
<br />
and the solution will be:<br /><br />
<math>\,Q=\begin{bmatrix} I_m \\ -(\phi^{uu})^{-1}\phi^{ul}\end{bmatrix}</math><br /><br />
As <math>\,W_{ij}</math> are invariant to translations and rotations of each input and its r-nearsest neighbors, the same weights can be used to reconstruct <math>\,y_i</math><br />
<br />
==Embedding the landmarks==<br />
Considering the factorization <math>\,K\approx QLQ^T</math>, we will get the following SDP:<br /><br />
Maximize trace(<math>\,QLQ^T</math>) subject to :<br /><br />
<br /><br />
1) <math>\,L\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}(QLQ^T)_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,(QLQ^T)_{ij}-2(QLQ^T)_{ij}+(QLQ^T)_{jj}\leq\|x_i-x_j\|^2</math><br /><br />
<br /><br />
Because the matrix factorization is an approximate, the distance constraint has changed from equality to inequality.<br /><br />
<br />
The computation time in semidefinite programming depends on the matrix size and the number of constraints. Here, the matrix size in lSDE is much smaller than SDE. The number of constraints are the same. However, the constraints in lSDE are not sparse and this may lead to an lSDE even slower than SDE. To overcome this problem, one can feed an initial subset of constraints like just centering and semidifiniteness instead of the whole constraints. In the case that the solution violates the unused constraints, these will be added to the problem and the SDP solver will be run again.<br />
<br />
==Experimental Results==<br />
In the first experiment, Figure 1<ref><br />
The same paper. Figure 1<br />
</ref>, only 1205 out of 43182 constraints had to be enforced based on Table 2.<br /><br />
[[File:Fig1.jpg|left|thumb|400px|Figure 1]] <br />
[[File:T1.jpg|none|thumb|400px|Table 1]]<br />
<br />
In the second experiment, Table 1<ref><br />
The same paper. Table 1<br />
</ref>, despite the huge dimensionality reduction from D=60000 to d=5, many expected neighbors were preserved.<br /><br />
[[File:Fig3_.jpg|left|thumb|400px|Figure 3]]<br />
[[File:Fig4_.jpg|none|thumb|400px|Figure 4]]<br />
<br />
<br />
In the third experiment, Figure 2<ref><br />
The same paper. Figure 2<br />
</ref>, the results from <math>lSDE</math> are differnet from the results from SDE, but they get close to each other with increasing number of landmarks. Here, as shown in Figure 4<ref><br />
The same paper. Figure 4<br />
</ref>, <math>lSDE</math> is slower than SDE. Because the ddataset is too small and has a particular cyclic structure so that incremental scheme for adding constraints doesn't work well.<br />
<br />
[[File:Fig2.jpg|left|thumb|800px|Figure 2]]<br />
[[File:T2.jpg|none|thumb|800px|Table 2]]<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graph_Laplacian_Regularization_for_Larg-Scale_Semidefinite_Programming&diff=2803graph Laplacian Regularization for Larg-Scale Semidefinite Programming2009-07-10T14:56:51Z<p>Myakhave: /* Results */</p>
<hr />
<div>==Introduction==<br />
This paper<ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Advances in neural information processing systems, 2007 - cs.utah.edu<br />
</ref> introduces a new approach for the discovery of low dimensional representations of high-dimensional data where, in many cases, local proximity measurements are also available. Sensor localization is a quite well-known example in this field. Existing approaches use semi-definite programs (SDPs) with low rank solutions from convex optimization methods. However, the SDP approach doesn’t scale well for large inputs. The main contribution of this paper is to use matrix factorization for solving very sophisticated problems of the above type that lead to much smaller and faster SDPs than previous approaches. This factorization comes from expanding the solution of the original problem in terms of the bottom eigenvectors of a graph laplacian. As the smaller SDPs coming from this factorization are only an approximation of the original problem, it can be refined using gradient-descent. The approach has been illustrated on localization of large scale sensor networks.<br /><br />
<br />
==Sensor localization==<br />
Assuming sensors can only estimate their local pairwise distances to nearby sensors via radio transmitters, the problem is to identify the whole network topology. In other words, knowing that we have n sensors with <math>d_{ij}</math> as an estimate of local distance between adjacent sensors, i and j, the desired output would be <math>x_1, x_2, ..., x_n \in R_2</math> as the planar coordinates of sensors. <br />
<br />
===Work on this issue so far===<br />
work on this issue so far starts with minimizing sum-of-squares loss function as<br /><br />
<math>\,\min_{x_1,...,x_n}\Sigma_{i\sim j}{(\|x_i-x_j\|^2-d_{ij}^2)^2}</math> (1)<br /> <br />
and adding a centering Constraint (assuming no sensor location is known in advance) as<br /> <math>\,\|\Sigma_i{x_i}\|^2 = 0</math> (2) <br /><br />
The problem here is that the optimization is not convex and is more likely to be trapped by local minima. For solving this problem, an <math>n \times n</math> inner product matrix X is defined as <math>X_{ij} = x_i \times x_j</math> and by relaxing the constraint that sensor locations <math>x_i</math> lie in the <math>R^2</math> plane , the following convex notation will be obtained:<br /><br />
Minimize: <math>\,\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (3)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>X \succeq 0</math><br /><br />
The vectors <math>\,x_i</math> will lie in a subspace with dimensionality equal to the rank of the solution X. Projecting <math>x_i</math>s into their 2D subspace of maximum variance, obtainied from the top 2 eigenvectors of X, will get planar coordinates. However, the higher the rank of X, the greater the information loss after projection. Increasing the error of projection with increased rank leads us to add a low rank, or equivalently, high trace constraint. Therefore, an extra term is added to favor solutions with high variance(high trace):<br /><br />
Maximize: <math>\,tr(X)-v\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (4)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>\,X \succeq 0</math><br /><br />
where the parameter <math>v>0</math> balances the trade-off between maximizing variance and preserving local distances (MVU).<br /><br />
<br />
==Matrix factorization==<br />
Assume G is a neighborhood graph defined by the sensor network and location of sensors is a function defined over the nodes of this graph. Functions on a graph can be approximated using eigenvectors of graph’s Laplacian matrix as basis functions (spectral graph theory)<br /><br />
graph Laplacian is defined by:<br /><br />
<br />
<math> L_{i,j}= \left\{\begin{matrix} <br />
deg(v_i) & \text{if } i=j \\ <br />
-1 & \text{if } i\neq j \text{ and } v_i \text{ adjacent } v_j \\ <br />
0 & \text{ otherwise}\end{matrix}\right.</math><br />
<br />
<br /><br />
<br />
Sensor's location can be approximated using the m bottom eigenvecotrs of the Laplacian matrix of G. Expanding these locations yields a matrix factorization for X so that :<br />
<math>x_i\approx\Sigma_{\alpha = 1}^m Q_{i\alpha}y_{\alpha}</math> <br /><br />
where Q is the <math>n*m</math> matrix with m bottom eigenvecors of Laplacian matrix and <math>y_{\alpha}</math> is unknown and depends on <math>d_{ij}</math>. Now, if we define the inner product of theses vectors as <math>Y_{\alpha\beta} = y_{\alpha}y_{\beta}</math> we will get the factorized matrix <math>X\approx QYQ^T</math> (6)<br /><br />
Using this approximation, we can solve an optimization for Y that is much smaller than X. Since Q stores mutulaly orthogonal eigenvectors we can imply <math>tr(Y)=tr(X)</math>. In addition, <math>QYQ^T</math> satisfies centering constraint because uniform eigenvectors are not included. Therefore, the optimization would change to to folloing equations:<br /><br />
Maximize: <math>tr(Y)-v\Sigma_{i\sim j}{((QYQ^T)_{ii}-2(QYQ^T)_{ij}+(QYQ^T)_{jj}-d_{ij}^2)^2}</math> (7)<br /><br />
subject to: <math>Y \succeq 0</math><br /><br />
<br />
==Formulation as SDP==<br />
Our goal is to cast the required optimization as SDP over small matrices with few constraints. Let <math>y\in R^{m^2}</math> be a vector obtained by concatenating all the columns of Y, <math>A \in R^{m^2 * m^2}</math> be a positive semidefinite matrix collecting all the quadratic coefficients in the objective function, <math>b\in R^{m^2}</math> be a vector collecting all the linear coefficients in the objective function, and <math>l</math> be a lower bound on the quadratic piece of the objective function. Using Schur’s lemma to express this bound as a linear matrix inequality, we will obtain the SDP:<br /><br />
<br /><br />
Maximize: <math>\,b^Ty - l</math> (9) <br /><br />
subject to: (i) <math>Y \succeq 0 </math> and (ii) <math>\begin{bmatrix} I & A^{1/2}y \\ (A^{1/2}y)^T & l \end{bmatrix} \succeq 0 </math><br /><br />
<br />
By puting in this form, The only variables of the SDP are the <math>m(m+1)/2</math> elements of Y and the unknown scalar l. Constraints decrease to the positive semidifinteness of Y and linear matrix inequality of <math>m^2*m^2</math>. It is worth noting that the complexity of the SDP does not depend on the number of nodes (n) or edges in the network.<br /><br />
<br />
==Gradient based improvement==<br />
As the matrix factorization only provides an approximation to the global minimum, a refinement is needed by using this result as the initial starting point for gradient descent in the first equation. It can be done in 2 steps:<br /><br />
First, starting from the m-dimensional solution of eq. (6), use conjugate gradient methods to maximize the objective function in eq. (4)<br /><br />
Second, project the results from the previous step into the R2 plane and use conjugate gradient methods to minimize the loss function in eq. (1).<br />
The conjugate gradient method is an iterative method for minimizing a quadratic function where its Hessian matrix (matrix of second partial derivatives) is positive.<br />
<br />
==Results==<br />
The figure <ref><br />
The same paper. Figure 2<br />
</ref> belowshows sensor locations inferred for n = 1055 largest cities in continental US.Local distances were estimated up to 18 neighbors within radius r = 0.09. Local measurements were corrupted by 10% Gaussian noise over the true local distance. Using m = 10 bottom eigenvectors of graph Laplacian, the solution provides a good initial starting point(left picture) for gradient-based improvement. The right picture is senssor locations after the improvement.<br /><br />
<br />
[[File:fig2L.jpg|left|thumb|400px|Initial Sensor locations]]<br />
[[File:fig2R.jpg|none|thumb|400px|Sensor locations after gradient improvement]]<br />
<br /><br />
<br />
The second simulated network, the figure<ref><br />
The same paper. Figure 3<br />
</ref> below, placed nodes at n=20,000 uniformly sampled points inside the unit square. local distances were estimated up to 20 other nodes within radius r = 0.06. Using m = 10 bottom eigenvectors of graph Laplacian, 19s was taken to construct and solve the SDP and 52s was taken for 100 iterations in conjugate gradient descent.<br /><br />
[[File:fig3.jpg|none|thumb|400px|Results on a simulated network with n=20000 uniformly distributed nodes inside a centerd unit squre]]<br />
<br /><br />
<br />
For the simulated networks with nodes at US cities, the figure<ref><br />
The same paper. Figure 4<br />
</ref> below plots the loss function in eq. (1) vs. number of eigenvectors. It also plots computation time vs. number of eigenvectors. It can be inferred from the figure that there is a trade-off between getting better solution and increasing computation time. Also, we can see that <math>m\approx 10</math> best manages this trade-off.<br /><br />
[[File:fig4.jpg|none|thumb|400px|Left: The value of loss function. Right: The computation time]]<br />
<br />
==Conclusion==<br />
An approach for inferring low dimensional representations from local distance constraints using MVU was proposed.<br /><br />
Using matrix factorization computed from the bottom eigenvectors of the graph Laplacian was the main idea of the approach.<br /><br />
The initial solution can be refined by local search methods.<br /><br />
This approach is suitable for large input and its complexity does not depend on the input.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graph_Laplacian_Regularization_for_Larg-Scale_Semidefinite_Programming&diff=2802graph Laplacian Regularization for Larg-Scale Semidefinite Programming2009-07-10T14:55:16Z<p>Myakhave: /* Results */</p>
<hr />
<div>==Introduction==<br />
This paper<ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Advances in neural information processing systems, 2007 - cs.utah.edu<br />
</ref> introduces a new approach for the discovery of low dimensional representations of high-dimensional data where, in many cases, local proximity measurements are also available. Sensor localization is a quite well-known example in this field. Existing approaches use semi-definite programs (SDPs) with low rank solutions from convex optimization methods. However, the SDP approach doesn’t scale well for large inputs. The main contribution of this paper is to use matrix factorization for solving very sophisticated problems of the above type that lead to much smaller and faster SDPs than previous approaches. This factorization comes from expanding the solution of the original problem in terms of the bottom eigenvectors of a graph laplacian. As the smaller SDPs coming from this factorization are only an approximation of the original problem, it can be refined using gradient-descent. The approach has been illustrated on localization of large scale sensor networks.<br /><br />
<br />
==Sensor localization==<br />
Assuming sensors can only estimate their local pairwise distances to nearby sensors via radio transmitters, the problem is to identify the whole network topology. In other words, knowing that we have n sensors with <math>d_{ij}</math> as an estimate of local distance between adjacent sensors, i and j, the desired output would be <math>x_1, x_2, ..., x_n \in R_2</math> as the planar coordinates of sensors. <br />
<br />
===Work on this issue so far===<br />
work on this issue so far starts with minimizing sum-of-squares loss function as<br /><br />
<math>\,\min_{x_1,...,x_n}\Sigma_{i\sim j}{(\|x_i-x_j\|^2-d_{ij}^2)^2}</math> (1)<br /> <br />
and adding a centering Constraint (assuming no sensor location is known in advance) as<br /> <math>\,\|\Sigma_i{x_i}\|^2 = 0</math> (2) <br /><br />
The problem here is that the optimization is not convex and is more likely to be trapped by local minima. For solving this problem, an <math>n \times n</math> inner product matrix X is defined as <math>X_{ij} = x_i \times x_j</math> and by relaxing the constraint that sensor locations <math>x_i</math> lie in the <math>R^2</math> plane , the following convex notation will be obtained:<br /><br />
Minimize: <math>\,\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (3)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>X \succeq 0</math><br /><br />
The vectors <math>\,x_i</math> will lie in a subspace with dimensionality equal to the rank of the solution X. Projecting <math>x_i</math>s into their 2D subspace of maximum variance, obtainied from the top 2 eigenvectors of X, will get planar coordinates. However, the higher the rank of X, the greater the information loss after projection. Increasing the error of projection with increased rank leads us to add a low rank, or equivalently, high trace constraint. Therefore, an extra term is added to favor solutions with high variance(high trace):<br /><br />
Maximize: <math>\,tr(X)-v\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (4)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>\,X \succeq 0</math><br /><br />
where the parameter <math>v>0</math> balances the trade-off between maximizing variance and preserving local distances (MVU).<br /><br />
<br />
==Matrix factorization==<br />
Assume G is a neighborhood graph defined by the sensor network and location of sensors is a function defined over the nodes of this graph. Functions on a graph can be approximated using eigenvectors of graph’s Laplacian matrix as basis functions (spectral graph theory)<br /><br />
graph Laplacian is defined by:<br /><br />
<br />
<math> L_{i,j}= \left\{\begin{matrix} <br />
deg(v_i) & \text{if } i=j \\ <br />
-1 & \text{if } i\neq j \text{ and } v_i \text{ adjacent } v_j \\ <br />
0 & \text{ otherwise}\end{matrix}\right.</math><br />
<br />
<br /><br />
<br />
Sensor's location can be approximated using the m bottom eigenvecotrs of the Laplacian matrix of G. Expanding these locations yields a matrix factorization for X so that :<br />
<math>x_i\approx\Sigma_{\alpha = 1}^m Q_{i\alpha}y_{\alpha}</math> <br /><br />
where Q is the <math>n*m</math> matrix with m bottom eigenvecors of Laplacian matrix and <math>y_{\alpha}</math> is unknown and depends on <math>d_{ij}</math>. Now, if we define the inner product of theses vectors as <math>Y_{\alpha\beta} = y_{\alpha}y_{\beta}</math> we will get the factorized matrix <math>X\approx QYQ^T</math> (6)<br /><br />
Using this approximation, we can solve an optimization for Y that is much smaller than X. Since Q stores mutulaly orthogonal eigenvectors we can imply <math>tr(Y)=tr(X)</math>. In addition, <math>QYQ^T</math> satisfies centering constraint because uniform eigenvectors are not included. Therefore, the optimization would change to to folloing equations:<br /><br />
Maximize: <math>tr(Y)-v\Sigma_{i\sim j}{((QYQ^T)_{ii}-2(QYQ^T)_{ij}+(QYQ^T)_{jj}-d_{ij}^2)^2}</math> (7)<br /><br />
subject to: <math>Y \succeq 0</math><br /><br />
<br />
==Formulation as SDP==<br />
Our goal is to cast the required optimization as SDP over small matrices with few constraints. Let <math>y\in R^{m^2}</math> be a vector obtained by concatenating all the columns of Y, <math>A \in R^{m^2 * m^2}</math> be a positive semidefinite matrix collecting all the quadratic coefficients in the objective function, <math>b\in R^{m^2}</math> be a vector collecting all the linear coefficients in the objective function, and <math>l</math> be a lower bound on the quadratic piece of the objective function. Using Schur’s lemma to express this bound as a linear matrix inequality, we will obtain the SDP:<br /><br />
<br /><br />
Maximize: <math>\,b^Ty - l</math> (9) <br /><br />
subject to: (i) <math>Y \succeq 0 </math> and (ii) <math>\begin{bmatrix} I & A^{1/2}y \\ (A^{1/2}y)^T & l \end{bmatrix} \succeq 0 </math><br /><br />
<br />
By puting in this form, The only variables of the SDP are the <math>m(m+1)/2</math> elements of Y and the unknown scalar l. Constraints decrease to the positive semidifinteness of Y and linear matrix inequality of <math>m^2*m^2</math>. It is worth noting that the complexity of the SDP does not depend on the number of nodes (n) or edges in the network.<br /><br />
<br />
==Gradient based improvement==<br />
As the matrix factorization only provides an approximation to the global minimum, a refinement is needed by using this result as the initial starting point for gradient descent in the first equation. It can be done in 2 steps:<br /><br />
First, starting from the m-dimensional solution of eq. (6), use conjugate gradient methods to maximize the objective function in eq. (4)<br /><br />
Second, project the results from the previous step into the R2 plane and use conjugate gradient methods to minimize the loss function in eq. (1).<br />
The conjugate gradient method is an iterative method for minimizing a quadratic function where its Hessian matrix (matrix of second partial derivatives) is positive.<br />
<br />
==Results==<br />
The figure <ref><br />
The same paper. Figure 2<br />
</ref> belowshows sensor locations inferred for n = 1055 largest cities in continental US.Local distances were estimated up to 18 neighbors within radius r = 0.09. Local measurements were corrupted by 10% Gaussian noise over the true local distance. Using m = 10 bottom eigenvectors of graph Laplacian, the solution provides a good initial starting point(left picture) for gradient-based improvement. The right picture is senssor locations after the improvement.<br /><br />
<br />
[[File:fig2L.jpg|left|thumb|400px|Initial Sensor locations]]<br />
[[File:fig2R.jpg|none|thumb|400px|Sensor locations after gradient improvement]]<br />
<br /><br />
<br />
The second simulated network, the figure below, placed nodes at n=20,000 uniformly sampled points inside the unit square. local distances were estimated up to 20 other nodes within radius r = 0.06. Using m = 10 bottom eigenvectors of graph Laplacian, 19s was taken to construct and solve the SDP and 52s was taken for 100 iterations in conjugate gradient descent.<br /><br />
[[File:fig3.jpg|none|thumb|400px|Results on a simulated network with n=20000 uniformly distributed nodes inside a centerd unit squre]]<br />
<br /><br />
<br />
For the simulated networks with nodes at US cities, the figure below plots the loss function in eq. (1) vs. number of eigenvectors. It also plots computation time vs. number of eigenvectors. It can be inferred from the figure that there is a trade-off between getting better solution and increasing computation time. Also, we can see that <math>m\approx 10</math> best manages this trade-off.<br /><br />
[[File:fig4.jpg|none|thumb|400px|Left: The value of loss function. Right: The computation time]]<br />
<br />
==Conclusion==<br />
An approach for inferring low dimensional representations from local distance constraints using MVU was proposed.<br /><br />
Using matrix factorization computed from the bottom eigenvectors of the graph Laplacian was the main idea of the approach.<br /><br />
The initial solution can be refined by local search methods.<br /><br />
This approach is suitable for large input and its complexity does not depend on the input.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graph_Laplacian_Regularization_for_Larg-Scale_Semidefinite_Programming&diff=2801graph Laplacian Regularization for Larg-Scale Semidefinite Programming2009-07-10T14:53:37Z<p>Myakhave: /* Results */</p>
<hr />
<div>==Introduction==<br />
This paper<ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Advances in neural information processing systems, 2007 - cs.utah.edu<br />
</ref> introduces a new approach for the discovery of low dimensional representations of high-dimensional data where, in many cases, local proximity measurements are also available. Sensor localization is a quite well-known example in this field. Existing approaches use semi-definite programs (SDPs) with low rank solutions from convex optimization methods. However, the SDP approach doesn’t scale well for large inputs. The main contribution of this paper is to use matrix factorization for solving very sophisticated problems of the above type that lead to much smaller and faster SDPs than previous approaches. This factorization comes from expanding the solution of the original problem in terms of the bottom eigenvectors of a graph laplacian. As the smaller SDPs coming from this factorization are only an approximation of the original problem, it can be refined using gradient-descent. The approach has been illustrated on localization of large scale sensor networks.<br /><br />
<br />
==Sensor localization==<br />
Assuming sensors can only estimate their local pairwise distances to nearby sensors via radio transmitters, the problem is to identify the whole network topology. In other words, knowing that we have n sensors with <math>d_{ij}</math> as an estimate of local distance between adjacent sensors, i and j, the desired output would be <math>x_1, x_2, ..., x_n \in R_2</math> as the planar coordinates of sensors. <br />
<br />
===Work on this issue so far===<br />
work on this issue so far starts with minimizing sum-of-squares loss function as<br /><br />
<math>\,\min_{x_1,...,x_n}\Sigma_{i\sim j}{(\|x_i-x_j\|^2-d_{ij}^2)^2}</math> (1)<br /> <br />
and adding a centering Constraint (assuming no sensor location is known in advance) as<br /> <math>\,\|\Sigma_i{x_i}\|^2 = 0</math> (2) <br /><br />
The problem here is that the optimization is not convex and is more likely to be trapped by local minima. For solving this problem, an <math>n \times n</math> inner product matrix X is defined as <math>X_{ij} = x_i \times x_j</math> and by relaxing the constraint that sensor locations <math>x_i</math> lie in the <math>R^2</math> plane , the following convex notation will be obtained:<br /><br />
Minimize: <math>\,\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (3)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>X \succeq 0</math><br /><br />
The vectors <math>\,x_i</math> will lie in a subspace with dimensionality equal to the rank of the solution X. Projecting <math>x_i</math>s into their 2D subspace of maximum variance, obtainied from the top 2 eigenvectors of X, will get planar coordinates. However, the higher the rank of X, the greater the information loss after projection. Increasing the error of projection with increased rank leads us to add a low rank, or equivalently, high trace constraint. Therefore, an extra term is added to favor solutions with high variance(high trace):<br /><br />
Maximize: <math>\,tr(X)-v\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (4)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>\,X \succeq 0</math><br /><br />
where the parameter <math>v>0</math> balances the trade-off between maximizing variance and preserving local distances (MVU).<br /><br />
<br />
==Matrix factorization==<br />
Assume G is a neighborhood graph defined by the sensor network and location of sensors is a function defined over the nodes of this graph. Functions on a graph can be approximated using eigenvectors of graph’s Laplacian matrix as basis functions (spectral graph theory)<br /><br />
graph Laplacian is defined by:<br /><br />
<br />
<math> L_{i,j}= \left\{\begin{matrix} <br />
deg(v_i) & \text{if } i=j \\ <br />
-1 & \text{if } i\neq j \text{ and } v_i \text{ adjacent } v_j \\ <br />
0 & \text{ otherwise}\end{matrix}\right.</math><br />
<br />
<br /><br />
<br />
Sensor's location can be approximated using the m bottom eigenvecotrs of the Laplacian matrix of G. Expanding these locations yields a matrix factorization for X so that :<br />
<math>x_i\approx\Sigma_{\alpha = 1}^m Q_{i\alpha}y_{\alpha}</math> <br /><br />
where Q is the <math>n*m</math> matrix with m bottom eigenvecors of Laplacian matrix and <math>y_{\alpha}</math> is unknown and depends on <math>d_{ij}</math>. Now, if we define the inner product of theses vectors as <math>Y_{\alpha\beta} = y_{\alpha}y_{\beta}</math> we will get the factorized matrix <math>X\approx QYQ^T</math> (6)<br /><br />
Using this approximation, we can solve an optimization for Y that is much smaller than X. Since Q stores mutulaly orthogonal eigenvectors we can imply <math>tr(Y)=tr(X)</math>. In addition, <math>QYQ^T</math> satisfies centering constraint because uniform eigenvectors are not included. Therefore, the optimization would change to to folloing equations:<br /><br />
Maximize: <math>tr(Y)-v\Sigma_{i\sim j}{((QYQ^T)_{ii}-2(QYQ^T)_{ij}+(QYQ^T)_{jj}-d_{ij}^2)^2}</math> (7)<br /><br />
subject to: <math>Y \succeq 0</math><br /><br />
<br />
==Formulation as SDP==<br />
Our goal is to cast the required optimization as SDP over small matrices with few constraints. Let <math>y\in R^{m^2}</math> be a vector obtained by concatenating all the columns of Y, <math>A \in R^{m^2 * m^2}</math> be a positive semidefinite matrix collecting all the quadratic coefficients in the objective function, <math>b\in R^{m^2}</math> be a vector collecting all the linear coefficients in the objective function, and <math>l</math> be a lower bound on the quadratic piece of the objective function. Using Schur’s lemma to express this bound as a linear matrix inequality, we will obtain the SDP:<br /><br />
<br /><br />
Maximize: <math>\,b^Ty - l</math> (9) <br /><br />
subject to: (i) <math>Y \succeq 0 </math> and (ii) <math>\begin{bmatrix} I & A^{1/2}y \\ (A^{1/2}y)^T & l \end{bmatrix} \succeq 0 </math><br /><br />
<br />
By puting in this form, The only variables of the SDP are the <math>m(m+1)/2</math> elements of Y and the unknown scalar l. Constraints decrease to the positive semidifinteness of Y and linear matrix inequality of <math>m^2*m^2</math>. It is worth noting that the complexity of the SDP does not depend on the number of nodes (n) or edges in the network.<br /><br />
<br />
==Gradient based improvement==<br />
As the matrix factorization only provides an approximation to the global minimum, a refinement is needed by using this result as the initial starting point for gradient descent in the first equation. It can be done in 2 steps:<br /><br />
First, starting from the m-dimensional solution of eq. (6), use conjugate gradient methods to maximize the objective function in eq. (4)<br /><br />
Second, project the results from the previous step into the R2 plane and use conjugate gradient methods to minimize the loss function in eq. (1).<br />
The conjugate gradient method is an iterative method for minimizing a quadratic function where its Hessian matrix (matrix of second partial derivatives) is positive.<br />
<br />
==Results==<br />
The figure below<ref><br />
The same paper. Figure 1<br />
</ref> shows sensor locations inferred for n = 1055 largest cities in continental US.Local distances were estimated up to 18 neighbors within radius r = 0.09. Local measurements were corrupted by 10% Gaussian noise over the true local distance. Using m = 10 bottom eigenvectors of graph Laplacian, the solution provides a good initial starting point(left picture) for gradient-based improvement. The right picture is senssor locations after the improvement.<br /><br />
<br />
[[File:fig2L.jpg|left|thumb|400px|Initial Sensor locations]]<br />
[[File:fig2R.jpg|none|thumb|400px|Sensor locations after gradient improvement]]<br />
<br /><br />
<br />
The second simulated network, the figure below, placed nodes at n=20,000 uniformly sampled points inside the unit square. local distances were estimated up to 20 other nodes within radius r = 0.06. Using m = 10 bottom eigenvectors of graph Laplacian, 19s was taken to construct and solve the SDP and 52s was taken for 100 iterations in conjugate gradient descent.<br /><br />
[[File:fig3.jpg|none|thumb|400px|Results on a simulated network with n=20000 uniformly distributed nodes inside a centerd unit squre]]<br />
<br /><br />
<br />
For the simulated networks with nodes at US cities, the figure below plots the loss function in eq. (1) vs. number of eigenvectors. It also plots computation time vs. number of eigenvectors. It can be inferred from the figure that there is a trade-off between getting better solution and increasing computation time. Also, we can see that <math>m\approx 10</math> best manages this trade-off.<br /><br />
[[File:fig4.jpg|none|thumb|400px|Left: The value of loss function. Right: The computation time]]<br />
<br />
==Conclusion==<br />
An approach for inferring low dimensional representations from local distance constraints using MVU was proposed.<br /><br />
Using matrix factorization computed from the bottom eigenvectors of the graph Laplacian was the main idea of the approach.<br /><br />
The initial solution can be refined by local search methods.<br /><br />
This approach is suitable for large input and its complexity does not depend on the input.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graph_Laplacian_Regularization_for_Larg-Scale_Semidefinite_Programming&diff=2725graph Laplacian Regularization for Larg-Scale Semidefinite Programming2009-07-09T03:25:47Z<p>Myakhave: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
This paper<ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Advances in neural information processing systems, 2007 - cs.utah.edu<br />
</ref> is about a new approach for the discovery of low dimensional representations of high-dimensional data where, in many cases, local proximity measurements also available. Sensor localization is an example. Existing approaches use semidefinite programs (SDPs) with low rank solutions from convex optimization methods. However, SDPs approach doesn’t scale well for large inputs. The main contribution of this paper uses matrix factorization for solving very large problems of the above type that leads to much smaller and faster SDPs than the previuos ones. This factorization comes from expanding the solution of the original problem in terms of the bottom eigenvectors of a graph laplacian. As the smaller SDPs coming from this factoriztion are only an approximation of the original problem, it can be refined using gradien-descent.The approach has been illustrated on localization of large scale sensor networks.<br /><br />
<br />
==Sensor localization==<br />
Assuming only nearby sensors can estimate their local pairwise distances via radio transmitters, the problem is to identify the whole network topology.In other words, knowing that we have n sensors with <math>d_{ij}</math> as an estimate of local distance between adjacent sensors i and j, the desired output would be <math>x_1, x_2, ..., x_n \in R_2</math> as the planar coordinates of sensors. <br />
<br />
===Work on this issue so far===<br />
work on this issue so far starts with minimizing sum-of-squares loss function as<br /><br />
<math>\,\min_{x_1,...,x_n}\Sigma_{i\sim j}{(\|x_i-x_j\|^2-d_{ij}^2)^2}</math> (1)<br /> <br />
and adding a centering Constraint (assuming no sensor location is known in advance) as<br /> <math>\,\|\Sigma_i{x_i}\|^2 = 0</math> (2) <br /><br />
The problem here is that the optimization is not convex and is more likely to be trapped by local minima. For solving this problem, an <math>n*n</math> inner product matrix X is defined as <math>X_{ij} = x_i*x_j</math> and by relaxing the constraint that sensor locations <math>x_i</math> lie in the <math>R^2</math> plane , the following convex notation will be obtained:<br /><br />
Minimize: <math>\,\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (3)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>X \succeq 0</math><br /><br />
The vectors <math>\,x_i</math> will lie in a subspace with dimension equal to the rank of the solution X. Projecting <math>x_i</math>s into their 2D subspace of maximum variance, obtaining from the top 2 eigenvectors of X, will get planar coordinates. Howevere, the higher the rank of X, the greater the information loss after projection. Growing the error of projection with increasing rank leads us to add a low rank, or equivalently, high trace constraint. Therefore, an extra term is added to favor solutions with high variance(high trace):<br /><br />
Maximize: <math>\,tr(X)-v\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (4)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>\,X \succeq 0</math><br /><br />
where the parameter <math>v>0</math> balances the trade-off between maximizing variance and preserving local distances (MVU)<br /><br />
<br />
==Matrix factorization==<br />
Assume G is a neighborhodd graph defined by the sensor network and location of sensors is a function defined over the nodes of this graph. Functions on a graph can be approximated using eigenvectors of graph’s Laplacian matrix as basis functions (spectral graph theory)<br /><br />
graph Laplacian is defined by:<br /><br />
<br />
<math> L_{i,j}= \left\{\begin{matrix} <br />
deg(v_i) & \text{if } i=j \\ <br />
-1 & \text{if } i\neq j \text{ and } v_i \text{ adjacent } v_j \\ <br />
0 & \text{ otherwise}\end{matrix}\right.</math><br />
<br />
<br /><br />
<br />
Sensor's location can be approximated using the m bottom eigenvecotrs of the Laplacian matrix of G. Expanding these locations yields a matrix factorization for X so that :<br />
<math>x_i\approx\Sigma_{\alpha = 1}^m Q_{i\alpha}y_{\alpha}</math> <br /><br />
where Q is the <math>n*m</math> matrix with m bottom eigenvecors of Laplacian matrix and <math>y_{\alpha}</math> is unknown and depends on <math>d_{ij}</math>. Now, if we define the inner product of theses vectors as <math>Y_{\alpha\beta} = y_{\alpha}y_{\beta}</math> we will get the factorized matrix <math>X\approx QYQ^T</math> (6)<br /><br />
Using this approximation, we can solve an optimization for Y that is much smaller than X. Since Q stores mutulaly orthogonal eigenvectors we can imply <math>tr(Y)=tr(X)</math>. In addition, <math>QYQ^T</math> satisfies centering constraint because uniform eigenvectors are not included. Therefore, the optimization would change to to folloing equations:<br /><br />
Maximize: <math>tr(Y)-v\Sigma_{i\sim j}{((QYQ^T)_{ii}-2(QYQ^T)_{ij}+(QYQ^T)_{jj}-d_{ij}^2)^2}</math> (7)<br /><br />
subject to: <math>Y \succeq 0</math><br /><br />
<br />
==Formulation as SDP==<br />
Our goal is to cast the required optimization as SDP over small matrices with few constraints. Let <math>y\in R^{m^2}</math> be a vector obtained by concatenating all the columns of Y, <math>A \in R^{m^2 * m^2}</math> be a positive semidefinite matrix collecting all the quadratic coefficients in the objective function, <math>b\in R^{m^2}</math> be a vector collecting all the linear coefficients in the objective function, and <math>l</math> be a lower bound on the quadratic piece of the objective function. Using Schur’s lemma to express this bound as a linear matrix inequality, we will obtain the SDP:<br /><br />
<br /><br />
Maximize: <math>\,b^Ty - l</math> (9) <br /><br />
subject to: (i) <math>Y \succeq 0 </math> and (ii) <math>\begin{bmatrix} I & A^{1/2}y \\ (A^{1/2}y)^T & l \end{bmatrix} \succeq 0 </math><br /><br />
<br />
By puting in this form, The only variables of the SDP are the <math>m(m+1)/2</math> elements of Y and the unknown scalar l. Constraints decrease to the positive semidifinteness of Y and linear matrix inequality of <math>m^2*m^2</math>. It is worth noting that the complexity of the SDP does not depend on the number of nodes (n) or edges in the network.<br /><br />
<br />
==Gradient based improvement==<br />
As the matrix factorization only provides an approximation to the global minimum, a refinemnetis needed by using it as initial starting point for gradient descent in the first equation. It can be done in 2 steps:<br /><br />
First, starting from the m-dimensional solution of eq. (6), use conjugate gradient methods to maximize the objective function in eq. (4)<br /><br />
Second, Project the results from the previous step into the R2 plane and use conjugate gradient methods to minimize the loss function in eq. (1)<br />
where conjugate gradient method is an iterative method for minimizing a quadratic function where its Hessian matrix (matrix of second partial derivatives) is positive<br />
<br />
==Results==<br />
The figure below shows sensor locations inferred for n = 1055 largest cities in continental US.Local distances were estimated up to 18 neighbors within radius r = 0.09. Local measurements were corrupted by 10% Gaussian noise over the true local distance. Using m = 10 bottom eigenvectors of graph Laplacian, the solution provides a good initial starting point(left picture) for gradient-based improvement. The right picture is senssor locations after the improvement.<br /><br />
<br />
[[File:fig2L.jpg|left|thumb|400px|Initial Sensor locations]]<br />
[[File:fig2R.jpg|none|thumb|400px|Sensor locations after gradient improvement]]<br />
<br /><br />
<br />
The second simulated network, the figure below, placed nodes at n=20,000 uniformly sampled points inside the unit square. local distances were estimated up to 20 other nodes within radius r = 0.06. Using m = 10 bottom eigenvectors of graph Laplacian, 19s was taken to construct and solve the SDP and 52s was taken for 100 iterations in conjugate gradient descent.<br /><br />
[[File:fig3.jpg|none|thumb|400px|Results on a simulated network with n=20000 uniformly distributed nodes inside a centerd unit squre]]<br />
<br /><br />
<br />
For the simulated networks with nodes at US cities, the figure below plots the loss function in eq. (1) vs. number of eigenvectors. It also plots computation time vs. number of eigenvectors. It can be inferred from the figure that there is a trade-off between getting better solution and increasing computation time. Also, we can see that <math>m\approx 10</math> best manages this trade-off.<br /><br />
[[File:fig4.jpg|none|thumb|400px|Left: The value of loss function. Right: The computation time]]<br />
<br />
==Conclusion==<br />
An approach for inferring low dimensional representations from local distance constraints using MVU was proposed.<br /><br />
Using matrix factorization computed from the bottom eigenvectors of the graph Laplacian was the main idea of the approach.<br /><br />
The initial solution can be refined by local search methods.<br /><br />
This approach is suitable for large input and its complexity does not depend on the input.<br />
<br />
=References=<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonlinear_Dimensionality_Reduction_by_Semidefinite_Programming_and_Kernel_Matrix_Factorization&diff=2724nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization2009-07-09T03:22:40Z<p>Myakhave: /* References */</p>
<hr />
<div>==Introduction ==<br />
In recent work, Semidefinite programming (SDE) that learns a kernel matrix by maximizing variance while preserving the distance and angles between nearest neigbors has been introduced. Although it has many advantages such as optimization convexity, it sufferes from computational cost for large problems. In this paper <ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Proceedings of the the Tenth international workshop on Artificial Intelligence and Statistics (AISTATS-05), pages 381-388, Barbados, West Indies, 2005.</ref><br />
, a new framework based on factorization of the entire kernel matrix in terms of a much smaller submatrix of inner products between randomly chosen landmarks has been proposed.<br />
<br />
==Semidefinite Embedding==<br />
Given high dimensional vectors <math>\,\{x_1, x_2, ..., x_n\} \in R^D </math> lying on or near a manifold that can be embeded in <math>\,d\ll D</math> dimensions to produce low dimensional vectors <math>\,\{y_1, y_2, ..., y_n\} \in R^d</math>, the goal is to find d and produce an appropriate embedding. The algorithm starts with computing k-nearest neighbors of each input and adding a constraint to preserve distances and angles between k-nearest neighbors:<br /><br />
<math>\,\|y_i-y_j\|^2 = \|x_i-x_j\|^2 </math> for all (i,j) in k-nearest neighbors (2)<br /><br />
and also a constraint on outputs to be centerd on the origin:<br /><br />
<math>\,\sum_i{y_i} = 0 </math> (3) <br /><br />
And maximizining the variance of outputs will be the final step:<br />
<math>\,var(y)=\sum_i{\|y_i\|^2} </math> (4)<br /><br />
Assuming <math>\,K_{ij}=y_i.y_j</math>, the above optimzation problem can be reformulated as an SDE:<br /><br />
Maximize trace(K) subject to :<br /><br />
1) <math>\,K\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}K_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,K_{ij}-2K_{ij}+K_{jj}=\|x_i-x_j\|^2</math><br /><br />
the top d eigenvalues and eigenvectors of the kernel matrix will be used to derive the embedding. Here, learning the kernel matrix dominates the total computation time.<br /><br />
<br />
==Kernel Matrix Factorization==<br />
As we saw in the last section, learning the kernel matrix takes many time. Therefore, if we can approximate it by a much smaller matrix , the computation time will be drastically improved. This can be done through the following steps:<br /><br />
First, reconstructing high dimensional datasate <math>\,\{x_i\}_{i=1}^n</math> from m randomly chosen landmarks <math>\,\{\mu_{\alpha}\}_{\alpha=1}^m</math> :<br /><br />
<math>\,\hat{x_i} = \sum_{\alpha}{Q_{i\alpha}\mu_{\alpha}} </math><br /><br />
Based on the similar intuition from locally linear embedding (LLE), the same linear transformation (Q) can be used to reconstruct the output:<br /><br />
<math>\,\hat{y_i} = \sum_{\alpha}{Q_{i\alpha}l_{\alpha}} </math> (6)<br /> <br />
Now, if we make the approximation:<br /><br />
<math>\,K_{ij}=y_i.y_j=\hat{y_i}.\hat{y_j}</math> (7)<br /><br />
substituting (6) into (7) gives <math>\,K\approx QLQ^T</math> where <math>\,L_{\alpha\beta} = l_{\alpha}.l_{\beta}</math><br />
<br />
==Reconstructing from landmarks==<br />
To derive Q, we assume that the manifold can be locally approximated by a linear subspace. Therefore, each input in the high dimensional space can be reconstructed by a weighted sum of its r-nearest neighbors. These weights are found by :<br /><br />
Minimize :<math>\,\varepsilon(W)=\sum_i{\|x_i-\Sigma_j{W_{ij}x_j}\|^2}</math><br /><br />
subject to:<math>\,\Sigma_j{W_{ij}}=1</math> for all j<br /><br />
and <math>\,W_{ij}=0</math> if <math>\,x_i</math> are not r-nearest neighbor of <math>\,x_j</math><br /><br />
Rewriting the reconstruction error as function of inputs will give us: <br />
<math>\,\varepsilon(X)=\sum_{ij}{\phi_{ij}x_i.x_j}</math><br /><br />
where <math>\,\phi = (I_n-W)^T(I_n-W)</math> or <br /><br />
<math>\,\phi=\begin{bmatrix} \overbrace{\phi^{ll}}^{m} & \overbrace{\phi^{lu}}^{n-m} \\ \phi^{ul} & \phi^{uu}\end{bmatrix} </math><br /><br />
<br />
<br />
and the solution will be:<br /><br />
<math>\,Q=\begin{bmatrix} I_m \\ -(\phi^{uu})^{-1}\phi^{ul}\end{bmatrix}</math><br /><br />
As <math>\,W_{ij}</math> are invariant to translations and rotations of each input and its r-nearsest neighbors, the same weights can be used to reconstruct <math>\,y_i</math><br />
<br />
==Embedding the landmarks==<br />
Considering the factorization <math>\,K\approx QLQ^T</math>, we will get the following SDP:<br /><br />
Maximize trace(<math>\,QLQ^T</math>) subject to :<br /><br />
<br /><br />
1) <math>\,L\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}(QLQ^T)_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,(QLQ^T)_{ij}-2(QLQ^T)_{ij}+(QLQ^T)_{jj}\leq\|x_i-x_j\|^2</math><br /><br />
<br /><br />
Because the matrix factorization is an approximate, the distance constraint has changed from equality to inequality.<br /><br />
<br />
The computation time in semidefinite programming depends on the matrix size and the number of constraints. Here, the matrix size in lSDE is much smaller than SDE. The number of constraints are the same. However, the constraints in lSDE are not sparse and this may lead to an lSDE even slower than SDE. To overcome this problem, one can feed an initial subset of constraints like just centering and semidifiniteness instead of the whole constraints. In the case that the solution violates the unused constraints, these will be added to the problem and the SDP solver will be run again.<br />
<br />
==Experimental Results==<br />
In the first experiment, Figure 1, only 1205 out of 43182 constraints had to be enforced based on Table 2.<br /><br />
[[File:Fig1.jpg|left|thumb|400px|Figure 1]] <br />
[[File:T1.jpg|none|thumb|400px|Table 1]]<br />
<br />
In the second experiment, Table 1, despite the huge dimensionality reduction from D=60000 to d=5, many expected neighbors were preserved.<br /><br />
[[File:Fig3_.jpg|left|thumb|400px|Figure 3]]<br />
[[File:Fig4_.jpg|none|thumb|400px|Figure 4]]<br />
<br />
<br />
In the third experiment, Figure 2, the results from <math>lSDE</math> are differnet from the results from SDE, but they get close to each other with increasing number of landmarks. Here, as shown in Figure 4, <math>lSDE</math> is slower than SDE. Because the ddataset is too small and has a particular cyclic structure so that incremental scheme for adding constraints doesn't work well.<br />
<br />
[[File:Fig2.jpg|left|thumb|800px|Figure 2]]<br />
[[File:T2.jpg|none|thumb|800px|Table 2]]<br />
<br />
==References==<br />
<references/></div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonlinear_Dimensionality_Reduction_by_Semidefinite_Programming_and_Kernel_Matrix_Factorization&diff=2723nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization2009-07-09T03:21:36Z<p>Myakhave: /* Introduction */</p>
<hr />
<div>==Introduction ==<br />
In recent work, Semidefinite programming (SDE) that learns a kernel matrix by maximizing variance while preserving the distance and angles between nearest neigbors has been introduced. Although it has many advantages such as optimization convexity, it sufferes from computational cost for large problems. In this paper <ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Proceedings of the the Tenth international workshop on Artificial Intelligence and Statistics (AISTATS-05), pages 381-388, Barbados, West Indies, 2005.</ref><br />
, a new framework based on factorization of the entire kernel matrix in terms of a much smaller submatrix of inner products between randomly chosen landmarks has been proposed.<br />
<br />
==Semidefinite Embedding==<br />
Given high dimensional vectors <math>\,\{x_1, x_2, ..., x_n\} \in R^D </math> lying on or near a manifold that can be embeded in <math>\,d\ll D</math> dimensions to produce low dimensional vectors <math>\,\{y_1, y_2, ..., y_n\} \in R^d</math>, the goal is to find d and produce an appropriate embedding. The algorithm starts with computing k-nearest neighbors of each input and adding a constraint to preserve distances and angles between k-nearest neighbors:<br /><br />
<math>\,\|y_i-y_j\|^2 = \|x_i-x_j\|^2 </math> for all (i,j) in k-nearest neighbors (2)<br /><br />
and also a constraint on outputs to be centerd on the origin:<br /><br />
<math>\,\sum_i{y_i} = 0 </math> (3) <br /><br />
And maximizining the variance of outputs will be the final step:<br />
<math>\,var(y)=\sum_i{\|y_i\|^2} </math> (4)<br /><br />
Assuming <math>\,K_{ij}=y_i.y_j</math>, the above optimzation problem can be reformulated as an SDE:<br /><br />
Maximize trace(K) subject to :<br /><br />
1) <math>\,K\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}K_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,K_{ij}-2K_{ij}+K_{jj}=\|x_i-x_j\|^2</math><br /><br />
the top d eigenvalues and eigenvectors of the kernel matrix will be used to derive the embedding. Here, learning the kernel matrix dominates the total computation time.<br /><br />
<br />
==Kernel Matrix Factorization==<br />
As we saw in the last section, learning the kernel matrix takes many time. Therefore, if we can approximate it by a much smaller matrix , the computation time will be drastically improved. This can be done through the following steps:<br /><br />
First, reconstructing high dimensional datasate <math>\,\{x_i\}_{i=1}^n</math> from m randomly chosen landmarks <math>\,\{\mu_{\alpha}\}_{\alpha=1}^m</math> :<br /><br />
<math>\,\hat{x_i} = \sum_{\alpha}{Q_{i\alpha}\mu_{\alpha}} </math><br /><br />
Based on the similar intuition from locally linear embedding (LLE), the same linear transformation (Q) can be used to reconstruct the output:<br /><br />
<math>\,\hat{y_i} = \sum_{\alpha}{Q_{i\alpha}l_{\alpha}} </math> (6)<br /> <br />
Now, if we make the approximation:<br /><br />
<math>\,K_{ij}=y_i.y_j=\hat{y_i}.\hat{y_j}</math> (7)<br /><br />
substituting (6) into (7) gives <math>\,K\approx QLQ^T</math> where <math>\,L_{\alpha\beta} = l_{\alpha}.l_{\beta}</math><br />
<br />
==Reconstructing from landmarks==<br />
To derive Q, we assume that the manifold can be locally approximated by a linear subspace. Therefore, each input in the high dimensional space can be reconstructed by a weighted sum of its r-nearest neighbors. These weights are found by :<br /><br />
Minimize :<math>\,\varepsilon(W)=\sum_i{\|x_i-\Sigma_j{W_{ij}x_j}\|^2}</math><br /><br />
subject to:<math>\,\Sigma_j{W_{ij}}=1</math> for all j<br /><br />
and <math>\,W_{ij}=0</math> if <math>\,x_i</math> are not r-nearest neighbor of <math>\,x_j</math><br /><br />
Rewriting the reconstruction error as function of inputs will give us: <br />
<math>\,\varepsilon(X)=\sum_{ij}{\phi_{ij}x_i.x_j}</math><br /><br />
where <math>\,\phi = (I_n-W)^T(I_n-W)</math> or <br /><br />
<math>\,\phi=\begin{bmatrix} \overbrace{\phi^{ll}}^{m} & \overbrace{\phi^{lu}}^{n-m} \\ \phi^{ul} & \phi^{uu}\end{bmatrix} </math><br /><br />
<br />
<br />
and the solution will be:<br /><br />
<math>\,Q=\begin{bmatrix} I_m \\ -(\phi^{uu})^{-1}\phi^{ul}\end{bmatrix}</math><br /><br />
As <math>\,W_{ij}</math> are invariant to translations and rotations of each input and its r-nearsest neighbors, the same weights can be used to reconstruct <math>\,y_i</math><br />
<br />
==Embedding the landmarks==<br />
Considering the factorization <math>\,K\approx QLQ^T</math>, we will get the following SDP:<br /><br />
Maximize trace(<math>\,QLQ^T</math>) subject to :<br /><br />
<br /><br />
1) <math>\,L\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}(QLQ^T)_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,(QLQ^T)_{ij}-2(QLQ^T)_{ij}+(QLQ^T)_{jj}\leq\|x_i-x_j\|^2</math><br /><br />
<br /><br />
Because the matrix factorization is an approximate, the distance constraint has changed from equality to inequality.<br /><br />
<br />
The computation time in semidefinite programming depends on the matrix size and the number of constraints. Here, the matrix size in lSDE is much smaller than SDE. The number of constraints are the same. However, the constraints in lSDE are not sparse and this may lead to an lSDE even slower than SDE. To overcome this problem, one can feed an initial subset of constraints like just centering and semidifiniteness instead of the whole constraints. In the case that the solution violates the unused constraints, these will be added to the problem and the SDP solver will be run again.<br />
<br />
==Experimental Results==<br />
In the first experiment, Figure 1, only 1205 out of 43182 constraints had to be enforced based on Table 2.<br /><br />
[[File:Fig1.jpg|left|thumb|400px|Figure 1]] <br />
[[File:T1.jpg|none|thumb|400px|Table 1]]<br />
<br />
In the second experiment, Table 1, despite the huge dimensionality reduction from D=60000 to d=5, many expected neighbors were preserved.<br /><br />
[[File:Fig3_.jpg|left|thumb|400px|Figure 3]]<br />
[[File:Fig4_.jpg|none|thumb|400px|Figure 4]]<br />
<br />
<br />
In the third experiment, Figure 2, the results from <math>lSDE</math> are differnet from the results from SDE, but they get close to each other with increasing number of landmarks. Here, as shown in Figure 4, <math>lSDE</math> is slower than SDE. Because the ddataset is too small and has a particular cyclic structure so that incremental scheme for adding constraints doesn't work well.<br />
<br />
[[File:Fig2.jpg|left|thumb|800px|Figure 2]]<br />
[[File:T2.jpg|none|thumb|800px|Table 2]]<br />
<br />
==References==</div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonlinear_Dimensionality_Reduction_by_Semidefinite_Programming_and_Kernel_Matrix_Factorization&diff=2722nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization2009-07-09T03:20:58Z<p>Myakhave: /* Experimental Results */</p>
<hr />
<div>==Introduction ==<br />
In recent work, Semidefinite programming (SDE) that learns a kernel matrix by maximizing variance while preserving the distance and angles between nearest neigbors has been introduced. Although it has many advantages such as optimization convexity, it sufferes from computational cost for large problems. In this paper<ref><br />
K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Proceedings of the the Tenth international workshop on Artificial Intelligence and Statistics (AISTATS-05), pages 381-388, Barbados, West Indies, 2005.<br />
</ref>, a new framework based on factorization of the entire kernel matrix in terms of a much smaller submatrix of inner products between randomly chosen landmarks has been proposed.<br />
<br />
==Semidefinite Embedding==<br />
Given high dimensional vectors <math>\,\{x_1, x_2, ..., x_n\} \in R^D </math> lying on or near a manifold that can be embeded in <math>\,d\ll D</math> dimensions to produce low dimensional vectors <math>\,\{y_1, y_2, ..., y_n\} \in R^d</math>, the goal is to find d and produce an appropriate embedding. The algorithm starts with computing k-nearest neighbors of each input and adding a constraint to preserve distances and angles between k-nearest neighbors:<br /><br />
<math>\,\|y_i-y_j\|^2 = \|x_i-x_j\|^2 </math> for all (i,j) in k-nearest neighbors (2)<br /><br />
and also a constraint on outputs to be centerd on the origin:<br /><br />
<math>\,\sum_i{y_i} = 0 </math> (3) <br /><br />
And maximizining the variance of outputs will be the final step:<br />
<math>\,var(y)=\sum_i{\|y_i\|^2} </math> (4)<br /><br />
Assuming <math>\,K_{ij}=y_i.y_j</math>, the above optimzation problem can be reformulated as an SDE:<br /><br />
Maximize trace(K) subject to :<br /><br />
1) <math>\,K\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}K_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,K_{ij}-2K_{ij}+K_{jj}=\|x_i-x_j\|^2</math><br /><br />
the top d eigenvalues and eigenvectors of the kernel matrix will be used to derive the embedding. Here, learning the kernel matrix dominates the total computation time.<br /><br />
<br />
==Kernel Matrix Factorization==<br />
As we saw in the last section, learning the kernel matrix takes many time. Therefore, if we can approximate it by a much smaller matrix , the computation time will be drastically improved. This can be done through the following steps:<br /><br />
First, reconstructing high dimensional datasate <math>\,\{x_i\}_{i=1}^n</math> from m randomly chosen landmarks <math>\,\{\mu_{\alpha}\}_{\alpha=1}^m</math> :<br /><br />
<math>\,\hat{x_i} = \sum_{\alpha}{Q_{i\alpha}\mu_{\alpha}} </math><br /><br />
Based on the similar intuition from locally linear embedding (LLE), the same linear transformation (Q) can be used to reconstruct the output:<br /><br />
<math>\,\hat{y_i} = \sum_{\alpha}{Q_{i\alpha}l_{\alpha}} </math> (6)<br /> <br />
Now, if we make the approximation:<br /><br />
<math>\,K_{ij}=y_i.y_j=\hat{y_i}.\hat{y_j}</math> (7)<br /><br />
substituting (6) into (7) gives <math>\,K\approx QLQ^T</math> where <math>\,L_{\alpha\beta} = l_{\alpha}.l_{\beta}</math><br />
<br />
==Reconstructing from landmarks==<br />
To derive Q, we assume that the manifold can be locally approximated by a linear subspace. Therefore, each input in the high dimensional space can be reconstructed by a weighted sum of its r-nearest neighbors. These weights are found by :<br /><br />
Minimize :<math>\,\varepsilon(W)=\sum_i{\|x_i-\Sigma_j{W_{ij}x_j}\|^2}</math><br /><br />
subject to:<math>\,\Sigma_j{W_{ij}}=1</math> for all j<br /><br />
and <math>\,W_{ij}=0</math> if <math>\,x_i</math> are not r-nearest neighbor of <math>\,x_j</math><br /><br />
Rewriting the reconstruction error as function of inputs will give us: <br />
<math>\,\varepsilon(X)=\sum_{ij}{\phi_{ij}x_i.x_j}</math><br /><br />
where <math>\,\phi = (I_n-W)^T(I_n-W)</math> or <br /><br />
<math>\,\phi=\begin{bmatrix} \overbrace{\phi^{ll}}^{m} & \overbrace{\phi^{lu}}^{n-m} \\ \phi^{ul} & \phi^{uu}\end{bmatrix} </math><br /><br />
<br />
<br />
and the solution will be:<br /><br />
<math>\,Q=\begin{bmatrix} I_m \\ -(\phi^{uu})^{-1}\phi^{ul}\end{bmatrix}</math><br /><br />
As <math>\,W_{ij}</math> are invariant to translations and rotations of each input and its r-nearsest neighbors, the same weights can be used to reconstruct <math>\,y_i</math><br />
<br />
==Embedding the landmarks==<br />
Considering the factorization <math>\,K\approx QLQ^T</math>, we will get the following SDP:<br /><br />
Maximize trace(<math>\,QLQ^T</math>) subject to :<br /><br />
<br /><br />
1) <math>\,L\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}(QLQ^T)_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,(QLQ^T)_{ij}-2(QLQ^T)_{ij}+(QLQ^T)_{jj}\leq\|x_i-x_j\|^2</math><br /><br />
<br /><br />
Because the matrix factorization is an approximate, the distance constraint has changed from equality to inequality.<br /><br />
<br />
The computation time in semidefinite programming depends on the matrix size and the number of constraints. Here, the matrix size in lSDE is much smaller than SDE. The number of constraints are the same. However, the constraints in lSDE are not sparse and this may lead to an lSDE even slower than SDE. To overcome this problem, one can feed an initial subset of constraints like just centering and semidifiniteness instead of the whole constraints. In the case that the solution violates the unused constraints, these will be added to the problem and the SDP solver will be run again.<br />
<br />
==Experimental Results==<br />
In the first experiment, Figure 1, only 1205 out of 43182 constraints had to be enforced based on Table 2.<br /><br />
[[File:Fig1.jpg|left|thumb|400px|Figure 1]] <br />
[[File:T1.jpg|none|thumb|400px|Table 1]]<br />
<br />
In the second experiment, Table 1, despite the huge dimensionality reduction from D=60000 to d=5, many expected neighbors were preserved.<br /><br />
[[File:Fig3_.jpg|left|thumb|400px|Figure 3]]<br />
[[File:Fig4_.jpg|none|thumb|400px|Figure 4]]<br />
<br />
<br />
In the third experiment, Figure 2, the results from <math>lSDE</math> are differnet from the results from SDE, but they get close to each other with increasing number of landmarks. Here, as shown in Figure 4, <math>lSDE</math> is slower than SDE. Because the ddataset is too small and has a particular cyclic structure so that incremental scheme for adding constraints doesn't work well.<br />
<br />
[[File:Fig2.jpg|left|thumb|800px|Figure 2]]<br />
[[File:T2.jpg|none|thumb|800px|Table 2]]<br />
<br />
==References==</div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonlinear_Dimensionality_Reduction_by_Semidefinite_Programming_and_Kernel_Matrix_Factorization&diff=2721nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization2009-07-09T03:20:07Z<p>Myakhave: /* Introduction */</p>
<hr />
<div>==Introduction ==<br />
In recent work, Semidefinite programming (SDE) that learns a kernel matrix by maximizing variance while preserving the distance and angles between nearest neigbors has been introduced. Although it has many advantages such as optimization convexity, it sufferes from computational cost for large problems. In this paper<ref><br />
K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Proceedings of the the Tenth international workshop on Artificial Intelligence and Statistics (AISTATS-05), pages 381-388, Barbados, West Indies, 2005.<br />
</ref>, a new framework based on factorization of the entire kernel matrix in terms of a much smaller submatrix of inner products between randomly chosen landmarks has been proposed.<br />
<br />
==Semidefinite Embedding==<br />
Given high dimensional vectors <math>\,\{x_1, x_2, ..., x_n\} \in R^D </math> lying on or near a manifold that can be embeded in <math>\,d\ll D</math> dimensions to produce low dimensional vectors <math>\,\{y_1, y_2, ..., y_n\} \in R^d</math>, the goal is to find d and produce an appropriate embedding. The algorithm starts with computing k-nearest neighbors of each input and adding a constraint to preserve distances and angles between k-nearest neighbors:<br /><br />
<math>\,\|y_i-y_j\|^2 = \|x_i-x_j\|^2 </math> for all (i,j) in k-nearest neighbors (2)<br /><br />
and also a constraint on outputs to be centerd on the origin:<br /><br />
<math>\,\sum_i{y_i} = 0 </math> (3) <br /><br />
And maximizining the variance of outputs will be the final step:<br />
<math>\,var(y)=\sum_i{\|y_i\|^2} </math> (4)<br /><br />
Assuming <math>\,K_{ij}=y_i.y_j</math>, the above optimzation problem can be reformulated as an SDE:<br /><br />
Maximize trace(K) subject to :<br /><br />
1) <math>\,K\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}K_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,K_{ij}-2K_{ij}+K_{jj}=\|x_i-x_j\|^2</math><br /><br />
the top d eigenvalues and eigenvectors of the kernel matrix will be used to derive the embedding. Here, learning the kernel matrix dominates the total computation time.<br /><br />
<br />
==Kernel Matrix Factorization==<br />
As we saw in the last section, learning the kernel matrix takes many time. Therefore, if we can approximate it by a much smaller matrix , the computation time will be drastically improved. This can be done through the following steps:<br /><br />
First, reconstructing high dimensional datasate <math>\,\{x_i\}_{i=1}^n</math> from m randomly chosen landmarks <math>\,\{\mu_{\alpha}\}_{\alpha=1}^m</math> :<br /><br />
<math>\,\hat{x_i} = \sum_{\alpha}{Q_{i\alpha}\mu_{\alpha}} </math><br /><br />
Based on the similar intuition from locally linear embedding (LLE), the same linear transformation (Q) can be used to reconstruct the output:<br /><br />
<math>\,\hat{y_i} = \sum_{\alpha}{Q_{i\alpha}l_{\alpha}} </math> (6)<br /> <br />
Now, if we make the approximation:<br /><br />
<math>\,K_{ij}=y_i.y_j=\hat{y_i}.\hat{y_j}</math> (7)<br /><br />
substituting (6) into (7) gives <math>\,K\approx QLQ^T</math> where <math>\,L_{\alpha\beta} = l_{\alpha}.l_{\beta}</math><br />
<br />
==Reconstructing from landmarks==<br />
To derive Q, we assume that the manifold can be locally approximated by a linear subspace. Therefore, each input in the high dimensional space can be reconstructed by a weighted sum of its r-nearest neighbors. These weights are found by :<br /><br />
Minimize :<math>\,\varepsilon(W)=\sum_i{\|x_i-\Sigma_j{W_{ij}x_j}\|^2}</math><br /><br />
subject to:<math>\,\Sigma_j{W_{ij}}=1</math> for all j<br /><br />
and <math>\,W_{ij}=0</math> if <math>\,x_i</math> are not r-nearest neighbor of <math>\,x_j</math><br /><br />
Rewriting the reconstruction error as function of inputs will give us: <br />
<math>\,\varepsilon(X)=\sum_{ij}{\phi_{ij}x_i.x_j}</math><br /><br />
where <math>\,\phi = (I_n-W)^T(I_n-W)</math> or <br /><br />
<math>\,\phi=\begin{bmatrix} \overbrace{\phi^{ll}}^{m} & \overbrace{\phi^{lu}}^{n-m} \\ \phi^{ul} & \phi^{uu}\end{bmatrix} </math><br /><br />
<br />
<br />
and the solution will be:<br /><br />
<math>\,Q=\begin{bmatrix} I_m \\ -(\phi^{uu})^{-1}\phi^{ul}\end{bmatrix}</math><br /><br />
As <math>\,W_{ij}</math> are invariant to translations and rotations of each input and its r-nearsest neighbors, the same weights can be used to reconstruct <math>\,y_i</math><br />
<br />
==Embedding the landmarks==<br />
Considering the factorization <math>\,K\approx QLQ^T</math>, we will get the following SDP:<br /><br />
Maximize trace(<math>\,QLQ^T</math>) subject to :<br /><br />
<br /><br />
1) <math>\,L\succeq 0 </math> <br /><br />
2)<math>\,\sum_{ij}(QLQ^T)_{ij}=0 </math> <br /><br />
3) For all (i,j) such that <math>\,\eta_{ij}=1 </math>, <br /><br />
<math>\,(QLQ^T)_{ij}-2(QLQ^T)_{ij}+(QLQ^T)_{jj}\leq\|x_i-x_j\|^2</math><br /><br />
<br /><br />
Because the matrix factorization is an approximate, the distance constraint has changed from equality to inequality.<br /><br />
<br />
The computation time in semidefinite programming depends on the matrix size and the number of constraints. Here, the matrix size in lSDE is much smaller than SDE. The number of constraints are the same. However, the constraints in lSDE are not sparse and this may lead to an lSDE even slower than SDE. To overcome this problem, one can feed an initial subset of constraints like just centering and semidifiniteness instead of the whole constraints. In the case that the solution violates the unused constraints, these will be added to the problem and the SDP solver will be run again.<br />
<br />
==Experimental Results==<br />
In the first experiment, Figure 1, only 1205 out of 43182 constraints had to be enforced based on Table 2.<br /><br />
[[File:Fig1.jpg|left|thumb|400px|Figure 1]] <br />
[[File:T1.jpg|none|thumb|400px|Table 1]]<br />
<br />
In the second experiment, Table 1, despite the huge dimensionality reduction from D=60000 to d=5, many expected neighbors were preserved.<br /><br />
[[File:Fig3_.jpg|left|thumb|400px|Figure 3]]<br />
[[File:Fig4_.jpg|none|thumb|400px|Figure 4]]<br />
<br />
<br />
In the third experiment, Figure 2, the results from <math>lSDE</math> are differnet from the results from SDE, but they get close to each other with increasing number of landmarks. Here, as shown in Figure 4, <math>lSDE</math> is slower than SDE. Because the ddataset is too small and has a particular cyclic structure so that incremental scheme for adding constraints doesn't work well.<br />
<br />
[[File:Fig2.jpg|left|thumb|800px|Figure 2]]<br />
[[File:T2.jpg|none|thumb|800px|Table 2]]</div>Myakhavehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graph_Laplacian_Regularization_for_Larg-Scale_Semidefinite_Programming&diff=2720graph Laplacian Regularization for Larg-Scale Semidefinite Programming2009-07-09T03:19:31Z<p>Myakhave: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
This paper<ref><br />
</ref> is about a new approach for the discovery of low dimensional representations of high-dimensional data where, in many cases, local proximity measurements also available. Sensor localization is an example. Existing approaches use semidefinite programs (SDPs) with low rank solutions from convex optimization methods. However, SDPs approach doesn’t scale well for large inputs. The main contribution of this paper uses matrix factorization for solving very large problems of the above type that leads to much smaller and faster SDPs than the previuos ones. This factorization comes from expanding the solution of the original problem in terms of the bottom eigenvectors of a graph laplacian. As the smaller SDPs coming from this factoriztion are only an approximation of the original problem, it can be refined using gradien-descent.The approach has been illustrated on localization of large scale sensor networks.<br /><br />
<br />
==Sensor localization==<br />
Assuming only nearby sensors can estimate their local pairwise distances via radio transmitters, the problem is to identify the whole network topology.In other words, knowing that we have n sensors with <math>d_{ij}</math> as an estimate of local distance between adjacent sensors i and j, the desired output would be <math>x_1, x_2, ..., x_n \in R_2</math> as the planar coordinates of sensors. <br />
<br />
===Work on this issue so far===<br />
work on this issue so far starts with minimizing sum-of-squares loss function as<br /><br />
<math>\,\min_{x_1,...,x_n}\Sigma_{i\sim j}{(\|x_i-x_j\|^2-d_{ij}^2)^2}</math> (1)<br /> <br />
and adding a centering Constraint (assuming no sensor location is known in advance) as<br /> <math>\,\|\Sigma_i{x_i}\|^2 = 0</math> (2) <br /><br />
The problem here is that the optimization is not convex and is more likely to be trapped by local minima. For solving this problem, an <math>n*n</math> inner product matrix X is defined as <math>X_{ij} = x_i*x_j</math> and by relaxing the constraint that sensor locations <math>x_i</math> lie in the <math>R^2</math> plane , the following convex notation will be obtained:<br /><br />
Minimize: <math>\,\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (3)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>X \succeq 0</math><br /><br />
The vectors <math>\,x_i</math> will lie in a subspace with dimension equal to the rank of the solution X. Projecting <math>x_i</math>s into their 2D subspace of maximum variance, obtaining from the top 2 eigenvectors of X, will get planar coordinates. Howevere, the higher the rank of X, the greater the information loss after projection. Growing the error of projection with increasing rank leads us to add a low rank, or equivalently, high trace constraint. Therefore, an extra term is added to favor solutions with high variance(high trace):<br /><br />
Maximize: <math>\,tr(X)-v\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2}</math> (4)<br /><br />
subject to: (i) <math>\,\Sigma_{ij}{X_{ij}=0}</math> and (ii) <math>\,X \succeq 0</math><br /><br />
where the parameter <math>v>0</math> balances the trade-off between maximizing variance and preserving local distances (MVU)<br /><br />
<br />
==Matrix factorization==<br />
Assume G is a neighborhodd graph defined by the sensor network and location of sensors is a function defined over the nodes of this graph. Functions on a graph can be approximated using eigenvectors of graph’s Laplacian matrix as basis functions (spectral graph theory)<br /><br />
graph Laplacian is defined by:<br /><br />
<br />
<math> L_{i,j}= \left\{\begin{matrix} <br />
deg(v_i) & \text{if } i=j \\ <br />
-1 & \text{if } i\neq j \text{ and } v_i \text{ adjacent } v_j \\ <br />
0 & \text{ otherwise}\end{matrix}\right.</math><br />
<br />
<br /><br />
<br />
Sensor's location can be approximated using the m bottom eigenvecotrs of the Laplacian matrix of G. Expanding these locations yields a matrix factorization for X so that :<br />
<math>x_i\approx\Sigma_{\alpha = 1}^m Q_{i\alpha}y_{\alpha}</math> <br /><br />
where Q is the <math>n*m</math> matrix with m bottom eigenvecors of Laplacian matrix and <math>y_{\alpha}</math> is unknown and depends on <math>d_{ij}</math>. Now, if we define the inner product of theses vectors as <math>Y_{\alpha\beta} = y_{\alpha}y_{\beta}</math> we will get the factorized matrix <math>X\approx QYQ^T</math> (6)<br /><br />
Using this approximation, we can solve an optimization for Y that is much smaller than X. Since Q stores mutulaly orthogonal eigenvectors we can imply <math>tr(Y)=tr(X)</math>. In addition, <math>QYQ^T</math> satisfies centering constraint because uniform eigenvectors are not included. Therefore, the optimization would change to to folloing equations:<br /><br />
Maximize: <math>tr(Y)-v\Sigma_{i\sim j}{((QYQ^T)_{ii}-2(QYQ^T)_{ij}+(QYQ^T)_{jj}-d_{ij}^2)^2}</math> (7)<br /><br />
subject to: <math>Y \succeq 0</math><br /><br />
<br />
==Formulation as SDP==<br />
Our goal is to cast the required optimization as SDP over small matrices with few constraints. Let <math>y\in R^{m^2}</math> be a vector obtained by concatenating all the columns of Y, <math>A \in R^{m^2 * m^2}</math> be a positive semidefinite matrix collecting all the quadratic coefficients in the objective function, <math>b\in R^{m^2}</math> be a vector collecting all the linear coefficients in the objective function, and <math>l</math> be a lower bound on the quadratic piece of the objective function. Using Schur’s lemma to express this bound as a linear matrix inequality, we will obtain the SDP:<br /><br />
<br /><br />
Maximize: <math>\,b^Ty - l</math> (9) <br /><br />
subject to: (i) <math>Y \succeq 0 </math> and (ii) <math>\begin{bmatrix} I & A^{1/2}y \\ (A^{1/2}y)^T & l \end{bmatrix} \succeq 0 </math><br /><br />
<br />
By puting in this form, The only variables of the SDP are the <math>m(m+1)/2</math> elements of Y and the unknown scalar l. Constraints decrease to the positive semidifinteness of Y and linear matrix inequality of <math>m^2*m^2</math>. It is worth noting that the complexity of the SDP does not depend on the number of nodes (n) or edges in the network.<br /><br />
<br />
==Gradient based improvement==<br />
As the matrix factorization only provides an approximation to the global minimum, a refinemnetis needed by using it as initial starting point for gradient descent in the first equation. It can be done in 2 steps:<br /><br />
First, starting from the m-dimensional solution of eq. (6), use conjugate gradient methods to maximize the objective function in eq. (4)<br /><br />
Second, Project the results from the previous step into the R2 plane and use conjugate gradient methods to minimize the loss function in eq. (1)<br />
where conjugate gradient method is an iterative method for minimizing a quadratic function where its Hessian matrix (matrix of second partial derivatives) is positive<br />
<br />
==Results==<br />
The figure below shows sensor locations inferred for n = 1055 largest cities in continental US.Local distances were estimated up to 18 neighbors within radius r = 0.09. Local measurements were corrupted by 10% Gaussian noise over the true local distance. Using m = 10 bottom eigenvectors of graph Laplacian, the solution provides a good initial starting point(left picture) for gradient-based improvement. The right picture is senssor locations after the improvement.<br /><br />
<br />
[[File:fig2L.jpg|left|thumb|400px|Initial Sensor locations]]<br />
[[File:fig2R.jpg|none|thumb|400px|Sensor locations after gradient improvement]]<br />
<br /><br />
<br />
The second simulated network, the figure below, placed nodes at n=20,000 uniformly sampled points inside the unit square. local distances were estimated up to 20 other nodes within radius r = 0.06. Using m = 10 bottom eigenvectors of graph Laplacian, 19s was taken to construct and solve the SDP and 52s was taken for 100 iterations in conjugate gradient descent.<br /><br />
[[File:fig3.jpg|none|thumb|400px|Results on a simulated network with n=20000 uniformly distributed nodes inside a centerd unit squre]]<br />
<br /><br />
<br />
For the simulated networks with nodes at US cities, the figure below plots the loss function in eq. (1) vs. number of eigenvectors. It also plots computation time vs. number of eigenvectors. It can be inferred from the figure that there is a trade-off between getting better solution and increasing computation time. Also, we can see that <math>m\approx 10</math> best manages this trade-off.<br /><br />
[[File:fig4.jpg|none|thumb|400px|Left: The value of loss function. Right: The computation time]]<br />
<br />
==Conclusion==<br />
An approach for inferring low dimensional representations from local distance constraints using MVU was proposed.<br /><br />
Using matrix factorization computed from the bottom eigenvectors of the graph Laplacian was the main idea of the approach.<br /><br />
The initial solution can be refined by local search methods.<br /><br />
This approach is suitable for large input and its complexity does not depend on the input.<br />
<br />
=References=<br />
<references/></div>Myakhave