visualizing Data using t-SNE

From statwiki
Jump to navigation Jump to search

Introduction

The paper <ref>Laurens van der Maaten, and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9: 2579-2605, 2008</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis. Stochastic Neighbor embedding. In Advances in Neural Information Processing Systems, vol. 15, pp, 883-840, Cambridge, MA, USA, 2002. The MIT Press.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.

Stochastic Neighbor Embedding

In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint [math]\displaystyle{ \mathbf x_j }[/math] to datapoint [math]\displaystyle{ \mathbf x_i }[/math] is then presented by the conditional probability, [math]\displaystyle{ \mathbf p_{j|i} }[/math], that [math]\displaystyle{ \mathbf x_i }[/math] would pick [math]\displaystyle{ \mathbf x_j }[/math] as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on [math]\displaystyle{ \mathbf x_i }[/math]. The [math]\displaystyle{ \mathbf p_{j|i} }[/math] is given as


[math]\displaystyle{ \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) } }[/math]

where [math]\displaystyle{ \mathbf k }[/math] is the effective number of the local neighbors, [math]\displaystyle{ \mathbf \sigma_i }[/math] is the variance of the Gaussian that is centered on [math]\displaystyle{ \mathbf x_i }[/math], and for every [math]\displaystyle{ \mathbf x_i }[/math], we set [math]\displaystyle{ \mathbf p_{i|i} = 0 }[/math]. It can be seen from this definition that, the closer the datapoints are, the higher the [math]\displaystyle{ \mathbf p_{j|i} }[/math] is. For the widely separated datapoints, [math]\displaystyle{ \mathbf p_{j|i} }[/math] is almost infinitesimal.

With the same idea, in the low-dimensional space, we model the similarity of map point [math]\displaystyle{ \mathbf y_j }[/math] to [math]\displaystyle{ \mathbf y_i }[/math] by the conditional probability [math]\displaystyle{ \mathbf q_{j|i} }[/math], which is given by


[math]\displaystyle{ q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) } }[/math]

where we set the variance of the Gaussian [math]\displaystyle{ \mathbf \sigma_i }[/math] to be [math]\displaystyle{ \frac{1}{\sqrt{2} } }[/math] (a different value will only result in rescaling of the final map). And again, we set [math]\displaystyle{ \mathbf q_{i|i} = 0 }[/math].

If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math] should be equal. Therefore, the aim of SNE is to minimize the mismatch between [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math]. This is achieved by minimizing the sum of Kullback-leibler divergence over all datapoints. The cost function of SNE is then expressed as


[math]\displaystyle{ C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} }[/math]

where [math]\displaystyle{ \mathbf P_i }[/math] and [math]\displaystyle{ \mathbf Q_i }[/math] are the conditional probability distribution over all other points for given [math]\displaystyle{ \mathbf x_i }[/math] and [math]\displaystyle{ \mathbf y_i }[/math]. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a big [math]\displaystyle{ \mathbf p_{j|i} }[/math], while a small cost for using a large [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a small [math]\displaystyle{ \mathbf p_{j|i} }[/math]. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.

The remaining parameter [math]\displaystyle{ \mathbf \sigma_i }[/math] here is selected by performing a binary search for the value of [math]\displaystyle{ \mathbf \sigma_i }[/math] that produces a [math]\displaystyle{ \mathbf P_i }[/math] with a fixed perplexity (a measure of the effective number of neighbors which is related to [math]\displaystyle{ \mathbf k }[/math]) that is selected by the user.

To minimize the cost function, gradient descent method is used. The gradient then is given as


[math]\displaystyle{ \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) }[/math]

which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point [math]\displaystyle{ \mathbf y_i }[/math] and all other neighbor points [math]\displaystyle{ \mathbf y_j }[/math], where is distance is [math]\displaystyle{ \mathbf (y_i-y_j) }[/math] and the stiffness of the spring is [math]\displaystyle{ \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) }[/math].

t-Distributed Stochastic Neighbor Embedding

Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space.

Symmetric SNE

In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, [math]\displaystyle{ \mathbf P }[/math] in the high-dimensional space and [math]\displaystyle{ \mathbf Q }[/math] in the high-dimensional space.

In this case, the pairwise similarities of the data points in high-dimensional space is given by,

[math]\displaystyle{ \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \neq l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) } }[/math]

and [math]\displaystyle{ \mathbf q_{ii} }[/math] in low-dimensional space is

[math]\displaystyle{ \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \neq l} \exp(-||y_k-y_l ||^2) } }[/math]

where [math]\displaystyle{ \mathbf p_{ii} }[/math] and [math]\displaystyle{ \mathbf q_{ii} }[/math] are still zero. When a high-dimensional datapoint [math]\displaystyle{ \mathbf x_i }[/math] is a outlier (far from all the other points), we set [math]\displaystyle{ \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} }[/math] to ensure that \sum_{j} p_{ij} > \frac {1}{2n} for all [math]\displaystyle{ \mathbf x_i }[/math]. This will make sure that all [math]\displaystyle{ \mathbf x_i }[/math] make significant contribution to the cost function, which is given as

[math]\displaystyle{ C = KL(P||Q) =\sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}} }[/math]

As we can see, by definition, we have [math]\displaystyle{ \mathbf p_{ij} = p_{ji} }[/math] and [math]\displaystyle{ \mathbf q_{ij} = q_{ji} }[/math]. This is why it is called symmetric SNE.

From the cost function, we have the gradient as simple as

[math]\displaystyle{ \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij}) }[/math]

which is the main advantage of symmetric SNE.

The Crowding Problem

The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around [math]\displaystyle{ i }[/math], and we try to model the pairwise distances from [math]\displaystyle{ i }[/math] to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderated distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint [math]\displaystyle{ i }[/math] to these too-distant map points. The very large number of such forces crushes together the points in the center of the map and prevents gaps from forming between the natural clusters.

Compensating for Mismatched Dimensionality by Mismatched Tails

Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve and exponential.

In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional [math]\displaystyle{ \mathbf p_{ij} }[/math] are still

[math]\displaystyle{ \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} }[/math]

while the joint probabilities [math]\displaystyle{ \mathbf q_{ij} }[/math] are defined as

[math]\displaystyle{ \mathbf q_{ij} = \frac{(1 + ||y_i-y_j ||^2 )^{-1}}{\sum_{k \neq l} (1 + ||y_k-y_l ||^2 )^{-1}} }[/math]

The gradient of the Kullback-Leibler divergence between [math]\displaystyle{ P }[/math] and the Student-t based joint probability distribution [math]\displaystyle{ Q }[/math] is then given by

[math]\displaystyle{ \frac{\partial C}{\partial y_i} = 4\sum_{j} (y_i-y_j)(p_{ij}-q_{ij})(1 + ||y_i-y_j ||^2 )^{-1} }[/math]

Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, and I. Sutskever et al.. Visualizing similarity data with a mixture of maps. In Proceeding of the 11th International Conference on Artificial Intelligence and Statistics, volume 2, page, 67-74, 2007.</ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that mentioned above. At the same time, these repulsions do not go infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.

Optimization Methods for t-SNE

One ways to optimize the t-SNE cost function is to use a momentum term to reduce the number of required iteration. To further improve the modeling results, two tricks called "early compression" and "early exaggeration" can be used. The "early compression" is to force the map points to stay close together at the early stage of the optimization so that it is easy for explore the space of possible global organizations of the data. "Early exaggeration" is to multiply all the [math]\displaystyle{ \mathbf p_{ij} }[/math]'s by a [math]\displaystyle{ n\gt 1 }[/math] in the initial stages of the optimization. This will make all the [math]\displaystyle{ \mathbf q_{ij} }[/math]'s too small to model their corresponding [math]\displaystyle{ \mathbf p_{ij} }[/math]'s, so that the modeling are forced to focus on large [math]\displaystyle{ \mathbf p_{ij} }[/math]'s. This leads to the formation of tight widely separated clusters in the map, which makes it very easy to move around the clusters for a good global organization.

Experiments with Different Data Sets

The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set.

When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.

t-SNE for Large Data Sets

Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point. [math]\displaystyle{ \mathbf p_{j|i} }[/math] denotes the fraction of random walk starting at landmark point [math]\displaystyle{ x_i }[/math] and terminate at landmark point [math]\displaystyle{ x_j }[/math]. To avoid the "short-circuits" caused by a noisy datapoint, the random walk-based affinity measure integrates over all paths through the neighborhood graph. The random walk-based similarities [math]\displaystyle{ \mathbf p_{j|i} }[/math] can be computed by explicitly performing the random walks on the neighborhood graph, or using an analytical solution <ref> L. Grady, 2006, Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11): 1768-1783, 2006. </ref>, which is more appropriate for very large data sets.

Weaknesses of t-SNE

Although t-SNE has demonstrated to be a favorable technique for data visualization, there are three potential weaknesses with this technique. (1) The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to d>3 dimensions due to the heavy tails of the Student t-distribution. (2) t-SNE might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints. (3) Another major weakness of t-SNE is that the cost function is not convex. This leads to the problem that several optimization parameters need to be chosen and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.

References

<references/>