visualizing Similarity Data with a Mixture of Maps
Introduction
The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.
SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.
Stochastic Neighbour Embedding
The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref> lies in converting high-dimensional distance or similarity data into a set of [math]\displaystyle{ \mathbf{ p_{j|i} } }[/math], each of which represent the probability that one object [math]\displaystyle{ i }[/math] pick another object [math]\displaystyle{ j }[/math] as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find [math]\displaystyle{ \mathbf{ p_{j|i} } }[/math] for each object [math]\displaystyle{ i }[/math] by using a spherical Gaussian distribution centered at the high-dimensional position of [math]\displaystyle{ i }[/math], [math]\displaystyle{ \mathbf{ X_{i}} }[/math]. We will set [math]\displaystyle{ \mathbf{ p_{i|i} = 0 } }[/math] and for [math]\displaystyle{ \mathbf{ j \neq i } }[/math],
Intuitively, if one object, say [math]\displaystyle{ \, i }[/math], is only allowed to pick one neighbor, say [math]\displaystyle{ \, j }[/math], [math]\displaystyle{ \, j }[/math] should be the best one with the minimum relative distance. In other words, the more the relative distance from [math]\displaystyle{ \, i }[/math], the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define [math]\displaystyle{ \mathbf p_{j|i} }[/math] so that the numerator is proportional to [math]\displaystyle{ \, j }[/math]'s distance from [math]\displaystyle{ \, i }[/math] and the denominator is proportional to sum of all probable neighbors' distance from [math]\displaystyle{ \, i }[/math].
Note that given a set of pairwise distances between objects, [math]\displaystyle{ \mathbf{|| x_i - x_j ||} }[/math], we can use the above equation to derive the same probabilities. In practice, given a set of [math]\displaystyle{ N }[/math] points, we set the variance of the Gaussian [math]\displaystyle{ \mathbf{ \sigma_i ^2} }[/math], either by hand or we find it by a binary search for the values of [math]\displaystyle{ \mathbf{ \sigma_i } }[/math] that make the entropy of the distribution over neighbours equal to [math]\displaystyle{ \mathbf{ \log_2 M} }[/math] (Remember that the entropy of the distribution [math]\displaystyle{ \mathbf{ P_i} }[/math] is defined as [math]\displaystyle{ \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx }[/math] and [math]\displaystyle{ \mathbf{ p(x)\log(1/p(x))} }[/math] is understood to be zero when [math]\displaystyle{ \mathbf{p(x)=0)} }[/math].) This is done by starting from a number [math]\displaystyle{ \mathbf{ M \ll N} }[/math] and performing the binary search until the entropy of [math]\displaystyle{ \mathbf{ P_i} }[/math] is within some predetermined small tolerance of [math]\displaystyle{ \mathbf{\log_2 M } }[/math].
Our main goal in SNE is to model [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] by using the conditional probabilities [math]\displaystyle{ \mathbf{q_{j|i}} }[/math], which are determined by the locations [math]\displaystyle{ \mathbf{ y_i} }[/math] of points in low-dimensional space:
The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] and induced [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] distributions over neighbours for each object:
The dimensionality of the [math]\displaystyle{ \mathbf{Y} }[/math] space is chosen to be much less than the number of objects.
The SNE cost function focuses more on local structure than global structure
In the introduction, we mentioned that SNE focuses more on modeling small similarities (local structure) rather than large similarities (global structure). The precise meaning of this statement can be understood by looking at the cost function.
Notice that making [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] large when [math]\displaystyle{ \mathbf{ p_{j|i}} }[/math] is small wastes some of the probability mass in the [math]\displaystyle{ \mathbf{Q} }[/math] distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.
Computational aspect of SNE
Despite the fact that differentiating [math]\displaystyle{ \mathbf{C} }[/math] is tedious because [math]\displaystyle{ \mathbf{y_k} }[/math] affects [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] via the normalization term in its definition, the final result is simple and has nice physical interpretation:
Using the steepest descent for minimizing [math]\displaystyle{ \mathbf{C} }[/math] in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the [math]\displaystyle{ \mathbf{y} }[/math] values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of [math]\displaystyle{ \mathbf{C} }[/math].
Symmetric SNE
An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:
In this case we define [math]\displaystyle{ \mathbf{p_{ij}} }[/math] by
[math]\displaystyle{ \mathbf{q_{ij}} }[/math]'s are defined by
and finally the symmetric version of our cost function, [math]\displaystyle{ \mathbf{C_{sym}} }[/math], becomes the KL divergence between the two distributions
The benefit of defining [math]\displaystyle{ \mathbf{p_{ij}} }[/math]'s like this is getting much simpler derivatives. If one of the high-dimensional points, [math]\displaystyle{ \mathbf{j} }[/math], is far from all of the others, all of the [math]\displaystyle{ \mathbf{p_{.j}} }[/math] will be very small. In this case, we replace [math]\displaystyle{ \mathbf{p_{ij}} }[/math] by [math]\displaystyle{ \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} }[/math]. When [math]\displaystyle{ \mathbf{j} }[/math] is far from all the other points, all of the [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] will be very small, but [math]\displaystyle{ \mathbf{p_{.|j}} }[/math] will sum to 1.
Aspect Maps
Another approach for defining [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] is allowing [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] to occur in several different two-dimensional maps and assigning a mixing proportion [math]\displaystyle{ \mathbf{\pi_{i}^{m}} }[/math] in m-th map to the object [math]\displaystyle{ \mathbf{i} }[/math]. Note that we should have [math]\displaystyle{ \mathbf{\sum_{m} \pi_{i}^{m}=1} }[/math]. Now by using these different maps, we define [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] as follows:
where
Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that there is at least one map in which [math]\displaystyle{ \mathbf{i} }[/math] is close to [math]\displaystyle{ \mathbf{j} }[/math] and provided that the versions of [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] in that map have high mixing proportions, it is possible for for [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] to be quite large even if [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] are far apart in all the other maps.
To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:
Now by substituting the definition of [math]\displaystyle{ \mathbf{z_k} }[/math] and reshuffling the terms we will have:
In practice, we will not be using mixing proportions [math]\displaystyle{ \mathbf{\pi_i^m} }[/math] themselves as parameters of the model; Instead, we define [math]\displaystyle{ \mathbf{w_i^m} }[/math] by:
as a result of that, the gradient becomes:
Modeling Human Word Association Data
In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study
were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.
SNE has some problems with ambiguous words such as 'fire', 'wood', and 'job'. Puting 'fire' close to 'wood' , and 'job' where the two latter words are not related is a misconception in SNE. AMSNE has a solution for this kind of problem by considering the word 'fire' as a mixture of two different meanings. 'fire', in one map, is close to 'wood' and in the other is close to 'job'. Besides ambiguity, a word belong to different places for a different reason. This may be seen in the word 'death' which is close to 'sad' & 'cancer' and 'military' & 'destruction' as well.
Applications
In NIPS 2008 Conference, there was a demonstration by L. van der Maaten and G.Hinton on Visualizing NIPS Cooperations using Multiple Maps t-SNE <ref>http://nips.cc/Conferences/2008/Program/event.php?ID=1472 </ref>. Their demonstration showed visualizations of NIPS co-authorships that were constructed by the multiple maps version of t-SNE. They showed that it is impossible for multidimensional scaling techniques to construct an appropriate visualization of the similarity data, because of the triangle inequality problem and therefore they created multiple maps version of t-SNE based on the idea proposed in this paper. In these maps, each author has a copy , and each copy is weighted by a mixing proportion (the mixing proportions for a single author over all maps sum up to 1). The multiple maps version of t-SNE can deal with the triangle inequality problem, and as a result, it is very good at visualizing NIPS co-authorship data.
References
<references/>