visualizing Data using t-SNE
Introduction
The paper <ref>Laurens van der Maaten, and Geoffrey Hinton, 2008. Visualizing Data using t-SNE.</ref> introduced a new nonlinear dimensionally reduction technique that "embeds" high-dimensional data into low-dimensional space. This technique is a variation of the Stochastic Neighbor embedding (SNE) that was proposed by Hinton and Roweis in 2002 <ref>G.E. Hinton and S.T. Roweis, 2002. Stochastic Neighbor embedding.</ref>, where the high-dimensional Euclidean distances between datapoints are converted into the conditional probability to describe their similarities. t-SNE, based on the same idea, is aimed to be easier for optimization and to solve the "crowding problem". In addition, the author showed that t-SNE can be applied to large data sets as well, by using random walks on neighborhood graphs. The performance of t-SNE is demonstrated on a wide variety of data sets and compared with many other visualization techniques.
Stochastic Neighbor Embedding
In SNE, the high-dimensional Euclidean distances between datapoints is first converted into probabilities. The similarity of datapoint [math]\displaystyle{ \mathbf x_i }[/math] to datapoint [math]\displaystyle{ \mathbf j_i }[/math] is then presented by the conditional probability, [math]\displaystyle{ \mathbf p_{j|i} }[/math], that [math]\displaystyle{ \mathbf x_i }[/math] would pick [math]\displaystyle{ \mathbf j_i }[/math] as its neighbor when neighbors are picked in proportion to their probability density under a Gaussian centered on [math]\displaystyle{ \mathbf x_i }[/math]. The [math]\displaystyle{ \mathbf p_{j|i} }[/math] is given as
where [math]\displaystyle{ \mathbf k }[/math] is the effective number of the local neighbors, [math]\displaystyle{ \mathbf \sigma_i }[/math] is the variance of the Gaussian that is centered on [math]\displaystyle{ \mathbf x_i }[/math], and for every [math]\displaystyle{ \mathbf x_i }[/math], we set [math]\displaystyle{ \mathbf p_{i|i} = 0 }[/math]. It can be seen from this definition that, the closer the datapoints are, the higher the [math]\displaystyle{ \mathbf p_{j|i} }[/math] is. For the widely separated datapoints, [math]\displaystyle{ \mathbf p_{j|i} }[/math] is almost infinitesimal.
With the same idea, in the low-dimensional space, we model the similarity of map point [math]\displaystyle{ \mathbf y_j }[/math] to [math]\displaystyle{ \mathbf y_i }[/math] by the conditional probability [math]\displaystyle{ \mathbf q_{j|i} }[/math], which is given by
where we set the variance of the Gaussian [math]\displaystyle{ \mathbf \sigma_i }[/math] to be [math]\displaystyle{ \frac{1}{\sqrt{2} } }[/math] (a different value will only result in rescaling of the final map). And again, we set [math]\displaystyle{ \mathbf q_{i|i} = 0 }[/math].
If the low-dimensional map points correctly present the high-dimensional datapoints, their conditional probabilities [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math] should be equal. Therefore, the aim of SNE is to minimize the mismatch between [math]\displaystyle{ \mathbf q_{j|i} }[/math] and [math]\displaystyle{ \mathbf p_{j|i} }[/math]. This is achieved by minimizing the sum of Kullback-leibler divergence over all datapoints. The cost function of SNE is then expressed as
where [math]\displaystyle{ \mathbf P_i }[/math] and [math]\displaystyle{ \mathbf Q_i }[/math] are the conditional probability distribution over all other points for given [math]\displaystyle{ \mathbf x_i }[/math] and [math]\displaystyle{ \mathbf y_i }[/math]. Since the Kullback-leibler divergence is asymmetric, there is a large cost for using a small [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a big [math]\displaystyle{ \mathbf p_{j|i} }[/math], while a small cost for using a large [math]\displaystyle{ \mathbf q_{j|i} }[/math] to model a small [math]\displaystyle{ \mathbf p_{j|i} }[/math]. Therefore, the SNE cost function focuses more on local structure. It enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.
The remaining parameter [math]\displaystyle{ \mathbf \sigma_i }[/math] here is selected by performing a binary search for the value of [math]\displaystyle{ \mathbf \sigma_i }[/math] that produces a [math]\displaystyle{ \mathbf P_i }[/math] with a fixed perplexity (a measure of the effective number of neighbors which is related to [math]\displaystyle{ \mathbf k }[/math]) that is selected by the user.
To minimize the cost function, gradient descent method is used. The gradient then is given as
which is simple and has a nice physical interpretation. The gradient can be seen as the resultant force induced by a set of springs between the map point [math]\displaystyle{ \mathbf y_i }[/math] and all other neighbor points [math]\displaystyle{ \mathbf y_j }[/math], where is distance is [math]\displaystyle{ \mathbf (y_i-y_j) }[/math] and the stiffness of the spring is [math]\displaystyle{ \mathbf ([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) }[/math].