visualizing Similarity Data with a Mixture of Maps

From statwiki
Jump to navigation Jump to search

Introduction

The main idea of this paper is to show how we can utilize several different two-dimensional maps in order to visualize a set of pairwise similarities. Aspect maps resemble both clustering (in modeling pair-wise similarities as a mixture of different types of similarity) and multi-dimensional scaling (in modeling each type of similarity by a two-dimensional map) . While methods such as PCA and MDS (Metric Multi-dimensional Scaling) are simple and fast, their main drawback can be seen in minimizing a cost function that is mainly focused on modeling large dissimilarities rather than small ones. As a result of that, they do not provide good visualizations of data that lies on a curved low-dimensional manifold in a high dimensional space. Also methods such as Local MDS, LLE, Maximum Variance Unfolding or Stochastic Neighbour Embedding (SNE) model local distances accurately in the two-dimensional visualization, but modeling of larger distances is done inaccurately.

SNE outweighs methods such as LLE in two ways: Despite difficulty of optimizing the SNE objective function, it leads to much better solutions and since SNE is based on probabilistic model, it is much more efficient in producing better visualization. In the next section, we will explain how SNE works.

Stochastic Neighbour Embedding

The core of SNE method <ref> G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15:833–840, 2003 </ref> lies in converting high-dimensional distance or similarity data into a set of [math]\displaystyle{ \mathbf{ p_{j|i} } }[/math], each of which represent the probability that one object [math]\displaystyle{ i }[/math] pick another object [math]\displaystyle{ j }[/math] as its neighbour if it was only allowed to pick one neighbour. For objects in high dimensional Euclidian space, where our data points consists of the coordinates of the objects, we can find [math]\displaystyle{ \mathbf{ p_{j|i} } }[/math] for each object [math]\displaystyle{ i }[/math] by using a spherical Gaussian distribution centered at the high-dimensional position of [math]\displaystyle{ i }[/math], [math]\displaystyle{ \mathbf{ X_{i}} }[/math]. We will set [math]\displaystyle{ \mathbf{ p_{i|i} = 0 } }[/math] and for [math]\displaystyle{ \mathbf{ j \neq i } }[/math],

[math]\displaystyle{ \mathbf p_{j|i} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma_i ^2 )}{\sum_{k \neq i} \exp(-||x_i-x_k ||^2/ 2\sigma_i ^2 ) } }[/math]

Intuitively, if one object,say [math]\displaystyle{ \, i }[/math], is only allowed to pick one neighbor, say [math]\displaystyle{ \, j }[/math], [math]\displaystyle{ \, j }[/math] should be the best one with the minimum relative distance. In other words, the more the relative distance from [math]\displaystyle{ \, i }[/math], the less the probability of being chosen as one-allowed neighbor. With this intuition, it makes sense to define [math]\displaystyle{ \mathbf p_{j|i} }[/math] so that the nominator is proportional to [math]\displaystyle{ \, j }[/math] distance from [math]\displaystyle{ \, i }[/math] and the denominator is proportional to sum of all probable neighbors distance from [math]\displaystyle{ \, i }[/math].

Note that given a set of pairwise distances between objects, [math]\displaystyle{ \mathbf{|| x_i - x_j ||} }[/math], we can use the above equation to derive the same probabilities. In practice, given a set of [math]\displaystyle{ N }[/math] points, we set the variance of the Gaussian [math]\displaystyle{ \mathbf{ \sigma_i ^2} }[/math], either by hand or we find it by a binary search for the values of [math]\displaystyle{ \mathbf{ \sigma_i } }[/math] that make the entropy of the distribution over neighbours equal to [math]\displaystyle{ \mathbf{ \log_2 M} }[/math] (Remember that the entropy of the distribution [math]\displaystyle{ \mathbf{ P_i} }[/math] is defined as [math]\displaystyle{ \int_{-\infty}^{+\infty}p(x)\log(1/p(x))dx }[/math] and [math]\displaystyle{ \mathbf{ p(x)\log(1/p(x))} }[/math] is understood to be zero when [math]\displaystyle{ \mathbf{p(x)=0)} }[/math].) This is done by starting from a number [math]\displaystyle{ \mathbf{ M \ll N} }[/math] and performing the binary search until the entropy of [math]\displaystyle{ \mathbf{ P_i} }[/math] is within some predetermined small tolerance of [math]\displaystyle{ \mathbf{\log_2 M } }[/math].

Our main goal in SNE is to model [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] by using the conditional probabilities [math]\displaystyle{ \mathbf{q_{j|i}} }[/math], which are determined by the locations [math]\displaystyle{ \mathbf{ y_i} }[/math] of points in low-dimensional space:

[math]\displaystyle{ q_{j|i} = \frac{\exp(-||y_i-y_j ||^2)}{\sum_{k \neq i} \exp(-||y_i-y_k ||^2) } }[/math]

The aim of embedding is to match these two distributions as well as possible. To do so, we minimize a cost function which is a sum of Kullback-Leibler divergences between the original [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] and induced [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] distributions over neighbours for each object:

[math]\displaystyle{ C = \sum_{i} KL(P_i||Q_i) =\sum_{i}\sum_{j \neq i} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} }[/math]

The dimensionality of the [math]\displaystyle{ \mathbf{Y} }[/math] space is chosen to be much less than the number of objects. Notice that making [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] large when [math]\displaystyle{ \mathbf{ p_{j|i}} }[/math] is small wastes some of the probability mass in the [math]\displaystyle{ \mathbf{Q} }[/math] distribution so there is a cost for modeling a big distance in the high-dimensional space, though it is less than the cost of modeling a small distance with a big one. Therefore SNE is an improvement over methods like LLE; While SNE emphasizes local distances, its cost function cleanly enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart. Despite the fact that differentiating [math]\displaystyle{ \mathbf{C} }[/math] is tedious because [math]\displaystyle{ \mathbf{y_k} }[/math] affects [math]\displaystyle{ \mathbf{ q_{j|i}} }[/math] via the normalization term in its definition, the final result is simple and has nice physical interpretation:

[math]\displaystyle{ \frac{\partial C}{\partial y_i} = 2\sum_{j} (y_i-y_j)([p_{j|i}-q_{j|i}]+[p_{i|j}-q_{i|j}]) }[/math]

Using the steepest descent for minimizing [math]\displaystyle{ \mathbf{C} }[/math] in which all of the points are adjusted in parallel is inefficient and has the drawback of getting stuck in poor local minima. In order to address this problem, we add gaussian noise to the [math]\displaystyle{ \mathbf{y} }[/math] values after each update. We start by a high level of noise and reduce the noise level rapidly to find the approximate noise level at which the structure starts starts to form in the low-dimensional map. Once we observed that a small increase in the noise level leads to a large decrease in the cost function, we can be sure that a structure is emerging; Now by repeating this process and starting from the noise level just above the level at which structure emerged and refining it gently, we can find low-dimensional maps that are significantly better minima of [math]\displaystyle{ \mathbf{C} }[/math].

Symmetric SNE

An alternative approach to SNE which is based on minimizing the divergence between conditional distributions, is to define a single joint distribution over all non-identical ordered pairs:

In this case we define [math]\displaystyle{ \mathbf{p_{ij}} }[/math] by

[math]\displaystyle{ \mathbf p_{ij} = \frac{\exp(-||x_i-x_j ||^2/ 2\sigma^2 )}{\sum_{k \lt l} \exp(-||x_k-x_l ||^2/ 2\sigma^2 ) } }[/math]

[math]\displaystyle{ \mathbf{q_{ij}} }[/math]'s are defined by

[math]\displaystyle{ \mathbf q_{ij} = \frac{\exp(-||y_i-y_j ||^2 )}{\sum_{k \lt l} \exp(-||x_k-x_l ||^2) } }[/math]

and finally the symmetric version of our cost function, [math]\displaystyle{ \mathbf{C_{sym}} }[/math], becomes the KL divergence between the two distributions

[math]\displaystyle{ C_{sym} = KL(P||Q) =\sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} }[/math]

The benefit of defining [math]\displaystyle{ \mathbf{p_{ij}} }[/math]'s like this is getting much simpler derivatives. If one of the high-dimensional points, [math]\displaystyle{ \mathbf{j} }[/math], is far from all of the others, all of the [math]\displaystyle{ \mathbf{p_{.j}} }[/math] will be very small. In this case, we replace [math]\displaystyle{ \mathbf{p_{ij}} }[/math] by [math]\displaystyle{ \mathbf{p_{ij}=0.5(p_{j|i}+p_{i|j})} }[/math]. When [math]\displaystyle{ \mathbf{j} }[/math] is far from all the other points, all of the [math]\displaystyle{ \mathbf{p_{j|i}} }[/math] will be very small, but [math]\displaystyle{ \mathbf{p_{.|j}} }[/math] will sum to 1.

Aspect Maps

Another approach for defining [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] is allowing [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] to occur in several different two-dimensional maps and assigning a mixing proportion [math]\displaystyle{ \mathbf{\pi_{i}^{m}} }[/math] in m-th map to the object [math]\displaystyle{ \mathbf{i} }[/math]. Note that we should have [math]\displaystyle{ \mathbf{\sum_{m} \pi_{i}^{m}=1} }[/math]. Now by using these different maps, we define [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] as follows:

[math]\displaystyle{ q_{i|j} = \frac{\sum_{m} \pi_{i}^{m}\pi_{j}^{m} e^{-d_{i,j}^{m}} }{z_i} }[/math]

where

[math]\displaystyle{ d_{i,j}^{m}=|| y_i^m-y_j^m ||^2, \quad z_i=\sum_{h}\sum_{m} \pi_{i}^{m} \pi_{h}^{m} e^{-d_{i,h}^{m}} }[/math]

Using a mixture model is very different from simply using a single space that has extra dimensions, because points that are far apart on one dimension cannot have a high [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] no matter how close together they are on the other dimensions; On the contrary, when we use a mixture model, provided that there is at least one map in which [math]\displaystyle{ \mathbf{i} }[/math] is close to [math]\displaystyle{ \mathbf{j} }[/math] and provided that the versions of [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] in that map have high mixing proportions, it is possible for for [math]\displaystyle{ \mathbf{q_{j|i}} }[/math] to be quite large even if [math]\displaystyle{ \mathbf{i} }[/math] and [math]\displaystyle{ \mathbf{j} }[/math] are far apart in all the other maps.

To optimize the aspect map models, we used Carl-Rasmussen's "minimize function" given in <ref> www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/ </ref>. The gradiants are given by:

[math]\displaystyle{ \frac{\partial C}{\partial \pi_i^m}=-\sum_{k}\sum_{l \neq k} p_{l|k} \frac{\partial}{\partial \pi_i^m} [\log q_{l|k}z_k -\log z_k] }[/math]

Now by substituting the definition of [math]\displaystyle{ \mathbf{z_k} }[/math] and reshuffling the terms we will have:

[math]\displaystyle{ \frac{\partial C}{\partial \pi_i^m}=\sum_{j}[\frac{1}{q_{j|i} z_i}(q_{j|i}-p_{j|i})+\frac{1}{q_{i|j} z_j}(q_{i|j}-p_{i|j}) ] \pi_{j}^{m}e^{-d^m_{i,j}} }[/math]

In practice, we will not be using mixing proportions [math]\displaystyle{ \mathbf{\pi_i^m} }[/math] themselves as parameters of the model; Instead, we define [math]\displaystyle{ \mathbf{w_i^m} }[/math] by:

[math]\displaystyle{ \pi_i^m = \frac{e^{-w_i^m}}{\sum_{m'}e^{-w_i^{m'}}} }[/math]

as a result of that, the gradient becomes:

[math]\displaystyle{ \frac{\partial C}{\partial w_i^m} = \pi_i^m \left[ \left(\sum_{m'}\frac{\partial C}{\partial \pi_i^{m'}} \pi_i^{m'}\right)-\frac{\partial C}{\partial \pi_i^m}\right] }[/math]

Modeling Human Word Association Data

In order to see how SNE works in practice, authors used The University of South Florida database on human word associations which is available on the web. Participants in the study were presented with a list of English words as cues, and asked to respond to each word with a word which was “meaningfully related or strongly associated” <ref> D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhyme, and word fragment norms. In http://www.usf.edu/FreeAssociation/, 1998. </ref> The database contains 5018 cue words, with an average of 122 responses to each.

References

<references/>