visualizing Data using t-SNE
Introduction
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton, 2008. Visualizing Data using t-SNE.</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis, 2002. Stochastic Neighbor embedding.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.
Stochastic Neighbor Embedding
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint [math]\displaystyle{ \mathbf x_i }[/math] to datapoint [math]\displaystyle{ \mathbf j_i }[/math] is then presented by the conditional probability, [math]\displaystyle{ \mathbf p_{j|i} }[/math], that [math]\displaystyle{ \mathbf x_i }[/math] would pick [math]\displaystyle{ \mathbf j_i }[/math] as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on [math]\displaystyle{ \mathbf x_i }[/math]. The [math]\displaystyle{ \mathbf p_{j|i} }[/math] is given as
where [math]\displaystyle{ \mathbf k }[/math] is the effective number of the local neighbors, [math]\displaystyle{ \mathbf \sigma_i }[/math] is the variance of the Gaussian that is centered on [math]\displaystyle{ \mathbf x_i }[/math], and for every [math]\displaystyle{ \mathbf x_i }[/math], we set [math]\displaystyle{ \mathbf p_{i|i} = 0 }[/math]. It can be seen from this definition that, the closer the datapoints are, the higher the [math]\displaystyle{ \mathbf p_{j|i} }[/math] is. For the widely separated datapoints, [math]\displaystyle{ \mathbf p_{j|i} }[/math] is almost infinitesimal.
With the same idea, in the low-dimensional space, we model the similarity of map point [math]\displaystyle{ \mathbf y_j }[/math] to [math]\displaystyle{ \mathbf y_i }[/math] by the conditional probability [math]\displaystyle{ \mathbf q_{j|i} }[/math], which is given by
where we set the variance of the Gaussian [math]\displaystyle{ \mathbf \sigma_i }[/math] to be [math]\displaystyle{ \frac{1}{\sqrt{2} } }[/math] (a different value will only result in rescaling of the final map). And again, we set [math]\displaystyle{ \mathbf q_{i|i} = 0 }[/math].
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math] should be equal. Therefore, the aim of SNE is to minimize the mismatch between [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math]. This is achieved by minimizing the sum of Kullback-leibler divergence over all datapoints. The cost function of SNE is then expressed as
where [math]\displaystyle{ \mathbf P_i }[/math] and [math]\displaystyle{ \mathbf Q_i }[/math] are the conditional probability distribution over all other points for given [math]\displaystyle{ \mathbf x_i }[/math] and [math]\displaystyle{ \mathbf y_i }[/math]. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a big [math]\displaystyle{ \mathbf p_{j|i} }[/math], while a small cost for using a large [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a small [math]\displaystyle{ \mathbf p_{j|i} }[/math]. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.
The remaining parameter [math]\displaystyle{ \mathbf \sigma_i }[/math] here is selected by performing a binary search for the value of [math]\displaystyle{ \mathbf \sigma_i }[/math] that produces a [math]\displaystyle{ \mathbf P_i }[/math] with a fixed perplexity (a measure of the effective number of neighbors which is related to [math]\displaystyle{ \mathbf k }[/math]) that is selected by the user.
To minimize the cost function, gradient descent method is used. The gradient then is given as
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point [math]\displaystyle{ \mathbf y_i }[/math] and all other neighbor points [math]\displaystyle{ \mathbf y_j }[/math], where is distance is [math]\displaystyle{ \mathbf (y_i-y_j) }[/math] and the stiffness of the spring is [math]\displaystyle{ \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) }[/math].
t-Distributed Stochastic Neighbor Embedding
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space.
Symmetric SNE
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, [math]\displaystyle{ \mathbf P }[/math] in the high-dimensional space and [math]\displaystyle{ \mathbf Q }[/math] in the high-dimensional space.
In this case, the pairwise similarities of the data points in high-dimensional space is given by,
and [math]\displaystyle{ \mathbf q_{ii} }[/math] in low-dimensional space is
where [math]\displaystyle{ \mathbf p_{ii} }[/math] and [math]\displaystyle{ \mathbf q_{ii} }[/math] are still zero. When a high-dimensional datapoint [math]\displaystyle{ \mathbf x_i }[/math] is a outlier (far from all the other points), we set [math]\displaystyle{ \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} }[/math] to ensure that \sum_{j} p_{ij} > \frac {1}{2n} for all [math]\displaystyle{ \mathbf x_i }[/math]. This will make sure that all [math]\displaystyle{ \mathbf x_i }[/math] make significant contribution to the cost function, which is given as
As we can see, by definition, we have [math]\displaystyle{ \mathbf p_{ij} = p_{ji} }[/math] and [math]\displaystyle{ \mathbf q_{ij} = q_{ji} }[/math]. This is why it is called symmetric SNE.
From the cost function, we have the gradient as simple as
which is the main advantage of symmetric SNE.
The Crowding Problem
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens when the datapoints are distributed in a region on a high-dimensional manifold around [math]\displaystyle{ i }[/math], and we try to model the pairwise distances from [math]\displaystyle{ i }[/math] to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderated distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint [math]\displaystyle{ i }[/math] to these too-distant map points. The very large number of such forces crushes together the points in the center of the map and prevents gaps from forming between the natural clusters.
Compensating for Mismatched Dimensionality by Mismatched Tails
Since the crowding problem is caused by the unwanted attractive forces between map points that present moderately dissimilar datapoints, one solution is to model these datapoints by a much larger distance in the map to eliminates the attractive forces. This can be achieved by using a probability distribution that has much heavier tails than a Gaussian to convert the distances into probabilities in the low-dimensional space. Student t-distribution is selected because it is closely related to the Gaussian distribution, but it is much faster computationally since it does not involve and exponential.
In t-SNE, Student t-distribution with one degree of freedom is employed in the low-dimensional map. Based on the symmetric SNE, the joint probabilities in high-dimensional [math]\displaystyle{ \mathbf p_{ij} }[/math] are still
while the joint probabilities [math]\displaystyle{ \mathbf q_{ij} }[/math] are defined as
The gradient of the Kullback-Leibler divergence between [math]\displaystyle{ P }[/math] and the Student-t based joint probability distribution [math]\displaystyle{ Q }[/math] is then given by
Compared with the gradients of SNE and UNI-SNE <ref> J.A. Cook, et al., 2007. Visualizing similarity data with a mixture of maps. </ref>, the t-SNE gradients introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map. This well prevents the crowding problem that mentioned above. At the same time, these repulsions do not go infinity, which prevents the dissimilar datapoints from being too far apart. Therefore, the t-SNE models dissimilar datepoints by means of large pairwise distance, while models similar datapoints by means of small pairwise distance. This results in the good representation of both local and global structure of the high-dimensional data.
Optimization Methods for t-SNE
Experiments with Different Data Sets=
The author performed t-SNE on five data sets and compared the results with seven other non-parametric dimensional reduction techniques to evaluate t-SNE. The five data sets that were employed are: (1) the MNIST data set, (2) the Olivetti faces data set, (3) the COIL-20 data set, (4) the word-feature data set, and (5) the Netflix data set.
When performed t-SNE on the MNIST data set, t-SNE constructed a map with clear and clean separations between different digit classes. At the same time, most of the local structures of the data is captured as well. On the another hand, Isomap and LLE provide very little insight into the class structure of the data, while Sammon map models the classes fairly well but does not separate them clearly. More experiment results and comparison is presented in the paper and supplemental materials.
t-SNE for Large Data Sets
Due to its computational and memory complexity, it is infeasible to apply the standard version of t-SNE to large data sets (which contain more than 10,000 data points). To solve this problem, t-SNE is modified to display a random set of landmark points in the way that uses the information of the whole data set. First, a neighborhood graph for all the data points is created under a selected number of neighbors. Then, for each of the selected landmark point, a random walk is defined, which starts from that landmark point and terminates as soon as it lands on another landmark point.