visualizing Data using t-SNE: Difference between revisions
Line 7: | Line 7: | ||
<br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> | <br> <center> <math> \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) }</math> </center> | ||
where <math> \mathbf k </math> is the effective number of the local neighbors | where <math> \mathbf k </math> is the effective number of the local neighbors, <math> \mathbf \sigma_i </math> is the variance of the Gaussian that is centered on <math> \mathbf x_i </math>, and for every <math> \mathbf x_i </math>, we set <math> \mathbf p_{i|i} = 0 </math>. It can be seen from this definition that, the closer the datapoints are, the higher the <math> \mathbf p_{j|i} </math> is. For the widely separated datapoints, <math> \mathbf p_{j|i} </math> is almost infinitesimal. | ||
With the same idea, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by | With the same idea, in the low-dimensional space, we model the similarity of map point <math> \mathbf y_j </math> to <math> \mathbf y_i </math> by the conditional probability <math> \mathbf q_{j|i} </math>, which is given by | ||
<br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center> | <br> <center> <math> q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) }</math> </center> | ||
where we set the variance of the Gaussian <math> \mathbf \sigma_i </math> to be <math> \frac{1}{\sqr {2} } </math>. | |||
<br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center> | <br> <center> <math> C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}</math> </center> |
Revision as of 20:35, 12 July 2009
Introduction
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton, 2008. Visualizing Data using t-SNE.</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis, 2002. Stochastic Neighbor embedding.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.
Stochastic Neighbor Embedding
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint [math]\displaystyle{ \mathbf x_i }[/math] to datapoint [math]\displaystyle{ \mathbf j_i }[/math] is then presented by the conditional probability, [math]\displaystyle{ \mathbf p_{j|i} }[/math], that [math]\displaystyle{ \mathbf x_i }[/math] would pick [math]\displaystyle{ \mathbf j_i }[/math] as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on [math]\displaystyle{ \mathbf x_i }[/math]. The [math]\displaystyle{ \mathbf p_{j|i} }[/math] is given as
where [math]\displaystyle{ \mathbf k }[/math] is the effective number of the local neighbors, [math]\displaystyle{ \mathbf \sigma_i }[/math] is the variance of the Gaussian that is centered on [math]\displaystyle{ \mathbf x_i }[/math], and for every [math]\displaystyle{ \mathbf x_i }[/math], we set [math]\displaystyle{ \mathbf p_{i|i} = 0 }[/math]. It can be seen from this definition that, the closer the datapoints are, the higher the [math]\displaystyle{ \mathbf p_{j|i} }[/math] is. For the widely separated datapoints, [math]\displaystyle{ \mathbf p_{j|i} }[/math] is almost infinitesimal.
With the same idea, in the low-dimensional space, we model the similarity of map point [math]\displaystyle{ \mathbf y_j }[/math] to [math]\displaystyle{ \mathbf y_i }[/math] by the conditional probability [math]\displaystyle{ \mathbf q_{j|i} }[/math], which is given by
where we set the variance of the Gaussian [math]\displaystyle{ \mathbf \sigma_i }[/math] to be [math]\displaystyle{ \frac{1}{\sqr {2} } }[/math].