visualizing Data using t-SNE
Introduction
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton, 2008. Visualizing Data using t-SNE.</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis, 2002. Stochastic Neighbor embedding.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.
Stochastic Neighbor Embedding
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint [math]\displaystyle{ \mathbf x_i }[/math] to datapoint [math]\displaystyle{ \mathbf j_i }[/math] is then presented by the conditional probability, [math]\displaystyle{ \mathbf p_{j|i} }[/math], that [math]\displaystyle{ \mathbf x_i }[/math] would pick [math]\displaystyle{ \mathbf j_i }[/math] as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on [math]\displaystyle{ \mathbf x_i }[/math]. The [math]\displaystyle{ \mathbf p_{j|i} }[/math] is given as
where [math]\displaystyle{ \mathbf k }[/math] is the effective number of the local neighbors, [math]\displaystyle{ \mathbf \sigma_i }[/math] is the variance of the Gaussian that is centered on [math]\displaystyle{ \mathbf x_i }[/math], and for every [math]\displaystyle{ \mathbf x_i }[/math], we set [math]\displaystyle{ \mathbf p_{i|i} = 0 }[/math]. It can be seen from this definition that, the closer the datapoints are, the higher the [math]\displaystyle{ \mathbf p_{j|i} }[/math] is. For the widely separated datapoints, [math]\displaystyle{ \mathbf p_{j|i} }[/math] is almost infinitesimal.
With the same idea, in the low-dimensional space, we model the similarity of map point [math]\displaystyle{ \mathbf y_j }[/math] to [math]\displaystyle{ \mathbf y_i }[/math] by the conditional probability [math]\displaystyle{ \mathbf q_{j|i} }[/math], which is given by
where we set the variance of the Gaussian [math]\displaystyle{ \mathbf \sigma_i }[/math] to be [math]\displaystyle{ \frac{1}{\sqrt{2} } }[/math] (a different value will only result in rescaling of the final map). And again, we set [math]\displaystyle{ \mathbf q_{i|i} = 0 }[/math].
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math] should be equal. Therefore, the aim of SNE is to minimize the mismatch between [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math]. This is achieved by minimizing the sum of Kullback-leibler divergence over all datapoints. The cost function of SNE is then expressed as
where [math]\displaystyle{ \mathbf P_i }[/math] and [math]\displaystyle{ \mathbf Q_i }[/math] are the conditional probability distribution over all other points for given [math]\displaystyle{ \mathbf x_i }[/math] and [math]\displaystyle{ \mathbf y_i }[/math]. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a big [math]\displaystyle{ \mathbf p_{j|i} }[/math], while a small cost for using a large [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a small [math]\displaystyle{ \mathbf p_{j|i} }[/math]. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.
The remaining parameter [math]\displaystyle{ \mathbf \sigma_i }[/math] here is selected by performing a binary search for the value of [math]\displaystyle{ \mathbf \sigma_i }[/math] that produces a [math]\displaystyle{ \mathbf P_i }[/math] with a fixed perplexity (a measure of the effective number of neighbors which is related to [math]\displaystyle{ \mathbf k }[/math]) that is selected by the user.
To minimize the cost function, gradient descent method is used. The gradient then is given as
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point [math]\displaystyle{ \mathbf y_i }[/math] and all other neighbor points [math]\displaystyle{ \mathbf y_j }[/math], where is distance is [math]\displaystyle{ \mathbf (y_i-y_j) }[/math] and the stiffness of the spring is [math]\displaystyle{ \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) }[/math].
t-Distributed Stochastic Neighbor Embedding
Although SNE showed relatively good visualizations, it has two main problems: difficulty in optimization and the "crowding problem". t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a variation of SNE, is aimed to alleviate these problems. The cost function of t-SNE differs from the one of SNE in two ways: (1) it uses a symmetric version of the SNE cost function, and (2) it uses a Student-t distribution instead of Gaussian to compute the conditional probability in the low-dimensional space.
Symmetric SNE
In symmetric SNE, instead of the sum of the Kullback-Leibler divergences between the conditional probabilities, the cost function is a single Kullback-Leibler divergence between two joint probability distributions, [math]\displaystyle{ \mathbf P }[/math] in the high-dimensional space and [math]\displaystyle{ \mathbf Q }[/math] in the high-dimensional space.
In this case, the pairwise similarities of the data points in high-dimensional space is given by,
and [math]\displaystyle{ \mathbf q_{ii} }[/math] in low-dimensional space is
where [math]\displaystyle{ \mathbf p_{ii} }[/math] and [math]\displaystyle{ \mathbf q_{ii} }[/math] are still zero. When a high-dimensional datapoint [math]\displaystyle{ \mathbf x_i }[/math] is a outlier (far from all the other points), we set [math]\displaystyle{ \mathbf{p_{ij}=\frac{(p_{j|i}+p_{i|j})}{2n}} }[/math] to ensure that \sum_{j} p_{ij} > \frac {1}{2n} for all [math]\displaystyle{ \mathbf x_i }[/math]. This will make sure that all [math]\displaystyle{ \mathbf x_i }[/math] make significant contribution to the cost function, which is given as
As we can see, by definition, we have [math]\displaystyle{ \mathbf p_{ij} = p_{ji} }[/math] and [math]\displaystyle{ \mathbf q_{ij} = q_{ji} }[/math]. This is why it is called symmetric SNE.
From the cost function, we have the gradient as simple as
which is the main advantage of symmetric SNE.
The Crowding Problem
The "crowding problem" that are addressed in the paper is defined as: "the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints". This happens if the datapoints are approximately uniformly distributed in a region on a high-dimensional manifold around i,