visualizing Similarity Data with a Mixture of Maps

From statwiki
Revision as of 14:37, 11 July 2009 by Sttse (talk | contribs) (→‎Remark)
Jump to navigation Jump to search

Introduction

The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.

SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.

Stochastic Neighbour Embedding

The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref> lies in converting high-dimensional distance or similarity data into a set of [math]\displaystyle{ \mathbf{ p_{j|i} } }[/math], each of which represent the probability that one object [math]\displaystyle{ i }[/math] pick another object [math]\displaystyle{ j }[/math] as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find [math]\displaystyle{ \mathbf{ p_{j|i} } }[/math] for each object [math]\displaystyle{ i }[/math] by using a spherical Gaussian distribution centered at the high-dimensional position of [math]\displaystyle{ i }[/math], [math]\displaystyle{ \mathbf{ X_{i}} }[/math]. We will set [math]\displaystyle{ \mathbf{ p_{i|i} = 0 } }[/math] and for [math]\displaystyle{ \mathbf{ j \neq i } }[/math],

[math]\displaystyle{ \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) } }[/math]

Intuitively, if one object, say [math]\displaystyle{ \, i }[/math], is only allowed to pick one neighbor, say [math]\displaystyle{ \, j }[/math], [math]\displaystyle{ \, j }[/math] should be the best one with the minimum relative distance. In other words, the more the relative distance from [math]\displaystyle{ \, i }[/math], the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define [math]\displaystyle{ \mathbf p_{j|i} }[/math] so that the numerator is proportional to [math]\displaystyle{ \, j }[/math]'s distance from [math]\displaystyle{ \, i }[/math] and the denominator is proportional to sum of all probable neighbors' distance from [math]\displaystyle{ \, i }[/math].

Note that given a set of pairwise distances between objects, [math]\displaystyle{ \mathbf{|| x_i - x_j ||} }[/math], we can use the above equation to derive the same probabilities. In practice, given a set of [math]\displaystyle{ N }[/math] points, we set the variance of the Gaussian [math]\displaystyle{ \mathbf{ \sigma_i ^2} }[/math], either by hand or we find it by a binary search for the values of [math]\displaystyle{ \mathbf{ \sigma_i } }[/math] that make the entropy of the distribution over neighbours equal to [math]\displaystyle{ \mathbf{ \log_2 M} }[/math] (Remember that the entropy of the distribution [math]\displaystyle{ \mathbf{ P_i} }[/math] is defined as [math]\displaystyle{ \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx }[/math] and [math]\displaystyle{ \mathbf{ p(x)\log(1/p(x))} }[/math] is understood to be zero when [math]\displaystyle{ \mathbf{p(x)=0)} }[/math].) This is done by starting from a number [math]\displaystyle{ \mathbf{ M \ll N} }[/math] and performing the binary search until the entropy of [math]\displaystyle{ \mathbf{ P_i} }[/math] is within some predetermined small tolerance of [math]\displaystyle{ \mathbf{\log_2 M } }[/math].

Our main goal in SNE is to model [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] by using the conditional probabilities [math]\displaystyle{ \mathbf{q_{j|i}} }[/math], which are determined by the locations [math]\displaystyle{ \mathbf{ y_i} }[/math] of points in low-dimensional space:

[math]\displaystyle{ q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) } }[/math]

The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] and induced [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] distributions over neighbours for each object:

[math]\displaystyle{ C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} }[/math]

The dimensionality of the [math]\displaystyle{ \mathbf{Y} }[/math] space is chosen to be much less than the number of objects.

The SNE cost function focuses more on local structure than global structure

In the introduction, we mentioned that SNE focuses more on modeling small similarities (local structure) rather than large similarities (global structure). The precise meaning of this statement can now be understood by looking at the cost function, or more specifically, the asymmetry of Kullback-Leibler divergence in the sense that [math]\displaystyle{ KL(X_1||X_2) \neq KL(X_2||X_1) \, }[/math].

Recall that in the context of coding theory, [math]\displaystyle{ KL(P||Q) \, }[/math] is the extra bits required to code samples from [math]\displaystyle{ P \, }[/math](the true probability) using a code based on [math]\displaystyle{ Q\, }[/math](the model probability). Also recall that in the optimal code, a sample with a higher probability is coded using fewer bits. When using a code based on [math]\displaystyle{ Q\, }[/math] to model samples from [math]\displaystyle{ P\, }[/math], inefficiency is unavoidable. Note that it is much more wasteful to (1)use longer-than-optimal codes for samples which has high probability than (2)use shorter-than-optimal codes for samples which has low probability. Case (1) corresponds to using a small [math]\displaystyle{ q_i \, }[/math] to model a big [math]\displaystyle{ p_i \, }[/math]; and Case (2) corresponds to using a big [math]\displaystyle{ q_i \, }[/math] to model a small [math]\displaystyle{ p_i \, }[/math]. This asymmetry is also clear from the mathematical definition of KL divergence.

Similarly, in SNE, it is much more costly to (1)use a small [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] to model a big [math]\displaystyle{ \mathbf{ p_{j|i}} }[/math] than (2)use a big [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] to model a small [math]\displaystyle{ \mathbf{ p_{j|i}} }[/math]. Case (1) corresponds to modeling small distances incorrectly; and Case (2) corresponds to modeling big distances incorrectly.

Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.

Computational aspect of SNE

Despite the fact that differentiating [math]\displaystyle{ \mathbf{C} }[/math] is tedious because [math]\displaystyle{ \mathbf{y_k} }[/math] affects [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] via the normalization term in its definition, the final result is simple and has nice physical interpretation:

[math]\displaystyle{ \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) }[/math]

Using the steepest descent for minimizing [math]\displaystyle{ \mathbf{C} }[/math] in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the [math]\displaystyle{ \mathbf{y} }[/math] values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of [math]\displaystyle{ \mathbf{C} }[/math].

Symmetric SNE

An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:

In this case we define [math]\displaystyle{ \mathbf{p_{ij}} }[/math] by

[math]\displaystyle{ \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \lt l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) } }[/math]

[math]\displaystyle{ \mathbf{q_{ij}} }[/math]'s are defined by

[math]\displaystyle{ \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \lt l} \exp(-||y_k-y_l ||^2) } }[/math]

and finally the symmetric version of our cost function, [math]\displaystyle{ \mathbf{C_{sym}} }[/math], becomes the KL divergence between the two distributions

[math]\displaystyle{ C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} }[/math]

The benefit of defining [math]\displaystyle{ \mathbf{p_{ij}} }[/math]'s like this is getting much simpler derivatives. If one of the high-dimensional points, [math]\displaystyle{ \mathbf{j} }[/math], is far from all of the others, all of the [math]\displaystyle{ \mathbf{p_{.j}} }[/math] will be very small. In this case, we replace [math]\displaystyle{ \mathbf{p_{ij}} }[/math] by [math]\displaystyle{ \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} }[/math]. When [math]\displaystyle{ \mathbf{j} }[/math] is far from all the other points, all of the [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] will be very small, but [math]\displaystyle{ \mathbf{p_{.|j}} }[/math] will sum to 1.

Remark on "symmetry"

"Symmetric" SNE refers to the fact that [math]\displaystyle{ \mathbf{p_{ij}} = \mathbf{p_{ji}} }[/math] as opposed to [math]\displaystyle{ \mathbf{p_{i|j}} \neq \mathbf{p_{j|i}} }[/math] in basic SNE. Symmetric SNE still uses the asymmetric KL divergence in its cost function.

Aspect Maps

Another approach for defining [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] is allowing [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] to occur in several different two-dimensional maps and assigning a mixing proportion [math]\displaystyle{ \mathbf{\pi_{i}^{m}} }[/math] in m-th map to the object [math]\displaystyle{ \mathbf{i} }[/math]. Note that we should have [math]\displaystyle{ \mathbf{\sum_{m} \pi_{i}^{m}=1} }[/math]. Now by using these different maps, we define [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] as follows:

[math]\displaystyle{ q_{j|i} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} }[/math]

where

[math]\displaystyle{ d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} }[/math]

Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that there is at least one map in which [math]\displaystyle{ \mathbf{i} }[/math] is close to [math]\displaystyle{ \mathbf{j} }[/math] and provided that the versions of [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] in that map have high mixing proportions, it is possible for for [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] to be quite large even if [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] are far apart in all the other maps.

To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:

[math]\displaystyle{ \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] }[/math]

Now by substituting the definition of [math]\displaystyle{ \mathbf{z_k} }[/math] and reshuffling the terms we will have:

[math]\displaystyle{ \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} }[/math]

In practice, we will not be using mixing proportions [math]\displaystyle{ \mathbf{\pi_i^m} }[/math] themselves as parameters of the model; Instead, we define [math]\displaystyle{ \mathbf{w_i^m} }[/math] by:

[math]\displaystyle{ \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} }[/math]

as a result of that, the gradient becomes:

[math]\displaystyle{ \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] }[/math]

Modeling Human Word Association Data

In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.
SNE has some problems with ambiguous words such as 'fire', 'wood', and 'job'. Puting 'fire' close to 'wood' , and 'job' where the two latter words are not related is a misconception in SNE. AMSNE has a solution for this kind of problem by considering the word 'fire' as a mixture of two different meanings. 'fire', in one map, is close to 'wood' and in the other is close to 'job'. Besides ambiguity, a word belong to different places for a different reason. This may be seen in the word 'death' which is close to 'sad' & 'cancer' and 'military' & 'destruction' as well.

Applications

In NIPS 2008 Conference, there was a demonstration by L. van der Maaten and G.Hinton on Visualizing NIPS Cooperations using Multiple Maps t-SNE <ref>http://nips.cc/Conferences/2008/Program/event.php?ID=1472 </ref>. Their demonstration showed visualizations of NIPS co-authorships that were constructed by the multiple maps version of t-SNE. They showed that it is impossible for multidimensional scaling techniques to construct an appropriate visualization of the similarity data, because of the triangle inequality problem and therefore they created multiple maps version of t-SNE based on the idea proposed in this paper. In these maps, each author has a copy , and each copy is weighted by a mixing proportion (the mixing proportions for a single author over all maps sum up to 1). The multiple maps version of t-SNE can deal with the triangle inequality problem, and as a result, it is very good at visualizing NIPS co-authorship data.

References

<references/>