kernelized Locality-Sensitive Hashing

Locality Sensitive Hashing (LSH) is a form of dimension reduction that finds embedding of high dimensional data into a low dimensional hamming space while providing probabilistic collision guarantees. That is, similar data points will have the same low dimensional mapping with high probability. One immediate application of LSH is large scale nearest neighbour search/classification. A large database of objects (e.g. images) can be partitioned into disjoint buckets so that objects of a single bucket share the same low dimensional representation which is used as a key to that bucket. At query time, the low dimensional representation of the query object determines a single bucket of “the most probably similar objects” which can then be searched in the traditional way.

For each similarity measure, a locality sensitive hashing method has to be designed carefully to ensure practically acceptable probabilistic collision guarantees. It was also noticed that most of the previous work done on LSH assumes that data points come from multidimensional vector space and the underlying embedding is explicitly known. However, that is not always the case. For example the RBF kernel maps the data to an infinite dimensional space which is intractable to explicitly work with. This paper generalizes locality sensitive hashing by proposing a fully kernelized method that provides the desirable collision guarantees.

Preliminaries

Central Limit Theorem

Suppose [math]\displaystyle{ \mathcal{D} }[/math] is a multivariate distribution with mean [math]\displaystyle{ \mu }[/math] and covariance [math]\displaystyle{ \Sigma }[/math]. Let [math]\displaystyle{ x_1, x_2, ..., x_t }[/math] be [math]\displaystyle{ t }[/math] random vectors sampled i.i.d from [math]\displaystyle{ \mathcal{D} }[/math]. The central limit theorem tells us that for sufficiently large [math]\displaystyle{ t }[/math], the random vector

[math]\displaystyle{ z_t = \sqrt{t}(\bar{x}_t - \mu) }[/math]

approximately follows a multivariate Gaussian distribution [math]\displaystyle{ \mathcal{N}(0,\Sigma) }[/math], where [math]\displaystyle{ \bar{x}_t = \frac{1}{t} \sum_i x_i }[/math]

Whitening Transform

If [math]\displaystyle{ x }[/math] is a random vector sampled from some multivariate distribution [math]\displaystyle{ \mathcal{D} }[/math] of a zero mean and covariance [math]\displaystyle{ \Sigma }[/math] then, the random vector [math]\displaystyle{ \Sigma^{-1/2}x }[/math] is of an identity covariance matrix.

[math]\displaystyle{ \Sigma }[/math] is defined as [math]\displaystyle{ \Sigma = E[xx^T] }[/math].

If [math]\displaystyle{ \Sigma }[/math] is positive definite, It will have an eigen decomposition in the form [math]\displaystyle{ \Sigma = V \Lambda V^T }[/math] with all eigenvalues [math]\displaystyle{ \gt 0 }[/math] (note that [math]\displaystyle{ \Sigma }[/math] is always symmetric positive semidefinite). Consequently, [math]\displaystyle{ \Sigma^{-1/2} = V \Lambda ^{-1/2} V^T }[/math]. The covariance of the random vector [math]\displaystyle{ \Sigma^{-1/2}x }[/math] is then,

[math]\displaystyle{ \tilde{\Sigma} = E[V \Lambda ^{-1/2} V^T x x^T V \Lambda ^{-1/2} V^T] = V \Lambda ^{-1/2} V^T E[xx^T] V \Lambda ^{-1/2} V^T = V \Lambda ^{-1/2} V^T \Sigma V \Lambda ^{-1/2} V^T = I }[/math]

Note that the positive definiteness assumption of [math]\displaystyle{ \Sigma }[/math] does not always hold making it impossible to do such a transformation in that case.

Kernel Centering

Given [math]\displaystyle{ n }[/math] data points that are accessible only through a kernel function and we are interested in the kernel matrix of a centered version of data points (with the mean subtracted from all points). Let the explicit data representation in the kernel space be [math]\displaystyle{ \Phi \in \mathbb{R}^{d \times n} }[/math]. An explicit centering of [math]\displaystyle{ \Phi }[/math] is given by [math]\displaystyle{ \Phi-\frac{1}{n}\Phi\textbf{e}\textbf{e}^T }[/math], where [math]\displaystyle{ \textbf{e} }[/math] is a vector of all ones of size [math]\displaystyle{ n }[/math].

Computing the kernel of the centered data:

[math]\displaystyle{ K_{cnt} = (\Phi-\frac{1}{n}\Phi\textbf{e}\textbf{e}^T)^T(\Phi-\frac{1}{n}\Phi\textbf{e}\textbf{e}^T) = \Phi^T\Phi - \frac{1}{n}\Phi^T\Phi \textbf{e}\textbf{e}^T-\frac{1}{n}\textbf{e}\textbf{e}^T\Phi^T\Phi + \frac{1}{n^2}\textbf{e}\textbf{e}^T\Phi^T\Phi\textbf{e}\textbf{e}^T = K - \frac{1}{n}K\textbf{e}\textbf{e}^T-\frac{1}{n}\textbf{e}\textbf{e}^TK + \frac{\textbf{e}^TK\textbf{e}}{n^2}\textbf{e}\textbf{e}^T }[/math]

where [math]\displaystyle{ K }[/math] is the kernel matrix of the original data points.

Locality Sensitive Hashing

The basic idea of Locality Sensitive Hashing [2] is to map each data point to a b-bit vector (called the hashkey) constructed by applying b independent binary-valued hash functions [math]\displaystyle{ h_1,h_2,...,h_b }[/math] to the data points. The hash functions are designed carefully so that similar data points (in terms of some similarity measure) are mapped to the same hashkey with high probability allowing sublinear retrieval of approximate nearest neighbors to any given data point (query).

A valid hash function [math]\displaystyle{ h }[/math] must satisfy the property

[math]\displaystyle{ Pr[h(x_i)= h(x_j)] = sim(x_i, x_j) }[/math]

where [math]\displaystyle{ sim(x_i, x_j) \in [0,1] }[/math] is the similarity between two the data points [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] in terms of the similarity measure for which the hashing scheme is being designed.

For example consider the cosine similarity: [math]\displaystyle{ sim(x_i, x_j) = \frac{x_i^Tx_j}{\|x_i\|_2 \|x_j\|_2} }[/math].

It was shown in [3] that for a vector [math]\displaystyle{ r }[/math] of the same dimension as the original data points and [math]\displaystyle{ r \sim \mathcal{N}(0,\textit{I}) }[/math], that

[math]\displaystyle{ Pr[sign(x_i^Tr) = sign(x_j^Tr)] = 1 - \frac{1}{\pi} cos^{-1} (\frac{x_i^Tx_j}{\|x_i\|_2 \|x_j\|_2}) }[/math]

Based on the result above, the author in [2] proposed the following hashing function for the cosine similarity:

[math]\displaystyle{ h_r(x) = \begin{cases} 1 & r^Tx \ge 0 \\ 0 & \text{otherwise} \end{cases} }[/math] (1)

We can obtain a b-bits vector for each data point [math]\displaystyle{ x }[/math] by applying (1) with b independently sampled vectors [math]\displaystyle{ r_1,r_2, ..., r_b }[/math] from [math]\displaystyle{ \mathcal{N}(0,\textit{I}) }[/math] .

Similarly, different hash functions have been developed in previous work to support other similarity measures (e.g. [math]\displaystyle{ \ell_p }[/math] norm, inner product, Mahalanobis metric, Jaccard similarity, cosine similarity, edit similarity and Hamming similarity).

Note that similarity measure is closely related to distance measure. We can convert those similarity measures to distance measures.

A distance measure should satisfy the following conditions:

(1) [math]\displaystyle{ d(x,x) \ge 0 }[/math]

(2) d(x,y) == 0 if and only if x==y

(3) d(x,y) == d(y,x)

(4) [math]\displaystyle{ d(x,y) \le d(x,z)+d(z,y) }[/math] (the triangle inequality)

We can show that the corresponding distance measures (Jaccard distance, cosine distance, edit distance and Hamming distance) satisfy the conditions.

This paper aims at developing a hashing function that can support arbitrary kernels.

Kernelized Locality Sensitive Hashing

Building on the results that led to the hashing function in (1), the authors defined a normalized similarity function in terms of an arbitrary kernel [math]\displaystyle{ k(.,.) }[/math] in the form:

[math]\displaystyle{ \begin{align} sim(x_i,x_j) & = k(x_i,x_j)/(\sqrt{k(x_i,x_i)k(x_j,x_j)}) \\ & = \phi(x_i)^T\phi(x_j)/(||\phi(x_i)||_2 ||\phi(x_j)||_2) &&& (2) \end{align} }[/math]

for some (possibly unknown) embedding function [math]\displaystyle{ \phi(.) }[/math]. The random projection based hashing method in (1) requires that both the random vector [math]\displaystyle{ r }[/math] and the explicit embedding of the data point [math]\displaystyle{ \phi(x) }[/math] to be known. That requirement will not be fulfilled when a fully kernelized approach is to be developed.

The main trick used by the authors is to construct [math]\displaystyle{ r }[/math] as a weighted sum of a subset of the data points embedded into the kernel space. That does not only make is possible to compute [math]\displaystyle{ h_r(\phi(x)) }[/math] in (1) in terms of kernels. But also, it ensures that [math]\displaystyle{ r }[/math] approximately follows a gaussian distribution as required in (1).

Let the embedding of the data points in the kernel space ([math]\displaystyle{ \phi(x_i) \forall i = 1,2,..,n }[/math]) follows some distribution [math]\displaystyle{ \mathcal{D} }[/math] with mean [math]\displaystyle{ \mu }[/math] and covariance [math]\displaystyle{ \Sigma }[/math]. From the central limit theorem, the random vector [math]\displaystyle{ z_t = \sqrt{t}(\frac{1}{t}\sum_{i \in S} \phi(x_i) - \mu) }[/math] follows a gaussian distribution [math]\displaystyle{ \mathcal{N}(0,\Sigma) }[/math] where [math]\displaystyle{ S }[/math] is a set of sufficiently large number of data points ([math]\displaystyle{ t }[/math]) sampled i.i.d from [math]\displaystyle{ \mathcal{D} }[/math]. Further, [math]\displaystyle{ \Sigma^{-1/2}z_t \sim \mathcal{N}(0,I) }[/math] by applying a whitening transform to [math]\displaystyle{ z_t }[/math]. Now, using [math]\displaystyle{ r = \Sigma^{-1/2}z_t }[/math],we can defined a hashing function for the similarity measure defined in (2) as

[math]\displaystyle{ h_r(x) = \begin{cases} 1 & \phi(x)^T\Sigma^{-1/2}z_t \ge 0 \\ 0 & \text{otherwise} \end{cases} }[/math]

The next goal is to estimate [math]\displaystyle{ \mu }[/math] and [math]\displaystyle{ \Sigma }[/math] using a sample of [math]\displaystyle{ p }[/math] data points {[math]\displaystyle{ \phi(x_1), \phi(x_2), ..,\phi(x_p) }[/math]} and computing [math]\displaystyle{ h(\phi(x)) = sign(\phi(x)^T\Sigma^{-1/2}z_t) }[/math] in a fully kernelized way.

Let K be a centered kernel over the [math]\displaystyle{ p }[/math] data points (see section 1.3 above). Also let the eigen decomposition of K be [math]\displaystyle{ K=U \Theta U^T }[/math] and the eigen decomposition of [math]\displaystyle{ \Sigma }[/math] be [math]\displaystyle{ \Sigma= V \Lambda V^T }[/math], so [math]\displaystyle{ \Sigma^{-1/2}=V \Lambda^{-1/2} V^T }[/math]. Let [math]\displaystyle{ \theta_k }[/math] and [math]\displaystyle{ \Lambda_k }[/math] denote the [math]\displaystyle{ k^{th} }[/math] eigenvalues of [math]\displaystyle{ K }[/math] and [math]\displaystyle{ \Sigma }[/math] respectively. And Let [math]\displaystyle{ u_k }[/math] and [math]\displaystyle{ v_k }[/math] denote the [math]\displaystyle{ k^{th} }[/math] eigenvectors of [math]\displaystyle{ K }[/math] and [math]\displaystyle{ \Sigma }[/math] respectively. Note that [math]\displaystyle{ \theta_k = \Lambda_k }[/math]

It is easy to verify that:

[math]\displaystyle{ \phi(x)^T\Sigma^{-1/2}z_t = \sum_{k=1}^{p} \frac{1}{\sqrt{\theta_k}}v_k^T\phi(x)v_k^Tz_t \qquad (3) }[/math]

It can be noticed that both [math]\displaystyle{ v_k^T\phi(x) }[/math] and [math]\displaystyle{ v_k^Tz_t }[/math] are projections of [math]\displaystyle{ \phi(x) }[/math] and [math]\displaystyle{ z_t }[/math] to the [math]\displaystyle{ k^{th} }[/math] eigenvector of the covariance matrix of {[math]\displaystyle{ \phi(x_1), \phi(x_2), ..,\phi(x_p) }[/math]}. That is exactly what Kernel PCA does to find a low dimensional embedding of data points in a kernelized way. Borrowing the projection formula from Kernel PCA:

[math]\displaystyle{ v_k^T\phi(x) = \sum_{i=1}^{p} \frac{1}{\sqrt{\theta_k}} u_k(i) \phi(x_i)^T \phi(x) \qquad (4) }[/math]

where [math]\displaystyle{ \phi(x_i) }[/math] is the embedding of the [math]\displaystyle{ i^{th} }[/math] (out of the [math]\displaystyle{ p }[/math]) sample and [math]\displaystyle{ u_k(i) }[/math] is the [math]\displaystyle{ i^{th} }[/math] entry of the eigenvector [math]\displaystyle{ u_k }[/math].

Similarly,

[math]\displaystyle{ v_k^Tz_t = \sum_{i=1}^{p} \frac{1}{\sqrt{\theta_k}} u_k(i) \phi(x_i)^T z_t \qquad (5) }[/math]

Substituting (4) and (5) into (3) and simplifying we get:

[math]\displaystyle{ \phi(x)^T\Sigma^{-1/2}z_t = \sum_{i=1}^{p} w(i) (\phi(x_i)^T \phi(x)) }[/math]

where [math]\displaystyle{ w(i) = \sum_{j=1}^{p}K_{ij}^{-3/2}\phi(x_j)^Tz_t }[/math].

Assume that the [math]\displaystyle{ t }[/math] samples used to compute [math]\displaystyle{ z_t }[/math] are a subset of the [math]\displaystyle{ p }[/math] samples used to estimate [math]\displaystyle{ \Sigma }[/math]. Expanding [math]\displaystyle{ z_t }[/math] which can be approximated as [math]\displaystyle{ z_t=\frac{1}{\sqrt{t}}\sum_{i \in S}\phi(x_i) }[/math] as a result of centering the [math]\displaystyle{ p }[/math] samples. [math]\displaystyle{ w(i) }[/math] is then defined as:

[math]\displaystyle{ w(i) = \frac{1}{\sqrt{t}}\sum_{j=1}^{p}\sum_{\ell \in S} K_{ij}^{-3/2} K_{j\ell} }[/math]

Now, It should be clear that [math]\displaystyle{ h(\phi(x)) }[/math] can be computed in a fully kernelized way, i.e.

[math]\displaystyle{ h(\phi(x))=sign(\sum_{i=1}^p w(i)(\phi(x_i)^T\phi(x)-\frac{1}{p}\sum_{j=1}^p\phi(x_i)^T\phi(x))) }[/math]

When computing a b-bits hash vector, a single subset of [math]\displaystyle{ p }[/math] points is needed to estimate [math]\displaystyle{ \Sigma }[/math] for all the [math]\displaystyle{ b }[/math] bits. For each individual bit, a different subset of [math]\displaystyle{ t }[/math] points needs to be sampled from the [math]\displaystyle{ p }[/math] points.

Discussion

Here is a list of discussion points:

1. Can we use Locality-Sensitive Hashing or the Kernelized version as a fast clustering method? What about the case where clustering is used as a preprocessing step? For example in [5], the authors proposed to use the centers of k-means clustering as landmarks for the Nystrom method. Can we rely on LSH-based clustering for such task?

2. Can we use LSH or KLSH to build the K-nearest neighbor affinity matrix used in LLE? what about the affinity matrix of Spectral clustering?

3. The presented method uses a sample of [math]\displaystyle{ p }[/math] data points to estimate the covariance matrix of the entire [math]\displaystyle{ n }[/math] data points. Clearly, [math]\displaystyle{ p }[/math] needs to be increased as [math]\displaystyle{ n }[/math] increases which can be a serious scalability bottleneck especially, when constructing and decomposing a kernel matrix over the [math]\displaystyle{ p }[/math] data points. The authors provided nothing but for an empirical evidence using a single data set to show that a relatively small value of [math]\displaystyle{ p }[/math] is enough. That data set was the 80 millions tiny images data set which consists of 80 millions images each of size 32 x 32. Further, the authors represented each image using Gist features which is a vector of dimension 384. Acceptable nearest neighbor retrieval results were claimed using an RBF kernel and a value of [math]\displaystyle{ p = 300 }[/math] . Do you think that evidence is enough to rely on that method for very large scale datasets? Does the value of [math]\displaystyle{ p }[/math] have something to do with the type, the nature, or the original dimension of the dataset? Does it have something to do with the kernel?

4. Quoting from the paper: “Beyond LSH, there are several other dimensionality reduction techniques based on the Johnson-Lindenstrauss lemma that involve computing random projections, and our results may additionally be of interest to those applications.” What are examples of such techniques? and How can LSH or KLSH be useful in such techniques?

Related Methods

Following is a list of interesting related topics. If time permitted, I would talk about them.

Johnson-Lindenstrauss Lemma
Random Projections
Random Fourier Features[4]

Experimental Result

The authors empirically validate the effectiveness of the proposed hashing scheme, the paper provides results on three data sets. The primary goal is to verify that the proposed KLSH method can take kernels with unknown feature embeddings, and use them to perform searches that are fast but still reliable relative to a linear scan. The paper presents results showing the percentage of database items searched with hashing as opposed to timing results, which are dependent on the particular optimizations of the code. The experiments show good results, for example in "tiny image" dataset the proposed method often retrieves neighbors very similar to those of the linear scan, but does so by searching only 0.98% of the 80 Million database images.

1. Example-Based Object Recognition

It uses Caltech 101 data set. The correspondence-based local feature kernel(CORR) is used in this experiment. The metric is trained with 15 images per class. A simple k-nearest neighbor classifier(K=1), which utilizes both a linear scan baseline and KLSH, is used to compute accuracy. There is a trade-off between accuracy and speed, under the adjustment of [math]\displaystyle{ \epsilon }[/math]. When [math]\displaystyle{ \epsilon = 0.2 }[/math], the best result of 59 percent hashing accuracy is obtained. As for baseline, its search time was quite poor. KLSH has better performance over this data set.

2.Example-Based Scene Recognition

The paper considers hashing with [math]\displaystyle{ X^2 }[/math]-kernel in this experiment. [math]\displaystyle{ X^2 }[/math]-kernel is defined by

[math]\displaystyle{ K_{X^w}(h_i,h_j)=exp(- {\frac 1 \gamma} {\sum_{k=1}^H} {\frac {(h_i(k)-h_j(k))^2} {h_i(k)+h_j(k)}} ) }[/math]

where [math]\displaystyle{ H }[/math] is the number of histogram and [math]\displaystyle{ \gamma }[/math] is a scaling parameter. A nearest neighbor classification task is taken again to compare linear scan and KLSH. And again, the results of KLSH matches better than the results of linear scan.

3.Large-Scale Image Search with Tiny Images

This paper provides results using very large-scale data of million images. Each image is of 32*32 pixels and very tiny. KLSH is applied to Gaussian RBF kernel this time. Hashing is much faster than linear scan when it comes to large scale data.

References

1. B. Kulis, K. Grauman, "Kernelized Locality-Sensitive Hashing,". In IEEE Transactions on Pattern Analysis and Machine Intelligence, June 2012.

2. M. S. Charikar, "Similarity Estimation Techniques from Rounding Algorithms". In the Annual ACM Symposium on Theory of computing, STOC 2002.

3. M. X. Goemans, D. P. Williamson, "Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming", In Journal of ACM, 1995.

4. A. Rahimi, B. Recht, "Random Features for Large-Scale Kernel Machines". In NIPS 2007.

5. K. Zhang, I. Tsang, J. Kwok, "Improved Nyström low-rank approximation and error analysis". In ICML 2008.