kernelized Locality-Sensitive Hashing

Locality Sensitive Hashing (LSH) is a form of dimension reduction that finds embedding of high dimensional data into a low dimensional hamming space while providing probabilistic collision guarantees. That is, similar data points will have the same low dimensional mapping with high probability. One immediate application of LSH is large scale nearest neighbour search/classification. A large database of objects (e.g. images) can be partitioned into disjoint buckets so that objects of a single bucket share the same low dimensional representation which is used as a key to that bucket. At query time, the low dimensional representation of the query object determines a single bucket of “the most probably similar objects” which can then be searched in the traditional way.

For each similarity measure, a locality sensitive hashing method has to be designed carefully to ensure practically acceptable probabilistic collision guarantees. It was also noticed that most of the previous work done on LSH assumes that data points come from multidimensional vector space and the underlying embedding is explicitly known. However, that is not always the case. For example the RBF kernel maps the data to an infinite dimensional space which is intractable to explicitly work with. This paper generalizes locality sensitive hashing by proposing a fully kernelized method that provides the desirable collision guarantees.

Preliminaries

Central Limit Theorem

Suppose [math]\displaystyle{ \mathcal{D} }[/math] is a multivariate distribution with mean [math]\displaystyle{ \mu }[/math] and covariance [math]\displaystyle{ \Sigma }[/math]. Let [math]\displaystyle{ x_1, x_2, ..., x_t }[/math] be [math]\displaystyle{ t }[/math] random vectors sampled i.i.d from [math]\displaystyle{ \mathcal{D} }[/math]. The central limit theorem tells us that for sufficiently large [math]\displaystyle{ t }[/math], the random vector

[math]\displaystyle{ z_t = \sqrt{t}(\bar{x}_t - \mu) }[/math]

approximately follows a multivariate Gaussian distribution [math]\displaystyle{ \mathcal{N}(0,\Sigma) }[/math], where [math]\displaystyle{ \bar{x}_t = \frac{1}{t} \sum_i x_i }[/math]

Whitening Transform

If [math]\displaystyle{ x }[/math] is a random vector sampled from some multivariate distribution [math]\displaystyle{ \mathcal{D} }[/math] of a zero mean and covariance [math]\displaystyle{ \Sigma }[/math] then, the random vector [math]\displaystyle{ \Sigma^{-1/2}x }[/math] is of an identity covariance matrix.

[math]\displaystyle{ \Sigma }[/math] is defined as [math]\displaystyle{ \Sigma = E[xx^T] }[/math].

If [math]\displaystyle{ \Sigma }[/math] is positive definite, It will have an eigen decomposition in the form [math]\displaystyle{ \Sigma = V \Lambda V^T }[/math] with all eigenvalues [math]\displaystyle{ \gt 0 }[/math] (note that [math]\displaystyle{ \Sigma }[/math] is always symmetric positive semidefinite). Consequently, [math]\displaystyle{ \Sigma^{-1/2} = V \Lambda ^{-1/2} V^T }[/math]. The covariance of the random vector [math]\displaystyle{ \Sigma^{-1/2}x }[/math] is then,

[math]\displaystyle{ \tilde{\Sigma} = E[V \Lambda ^{-1/2} V^T x x^T V \Lambda ^{-1/2} V^T] = V \Lambda ^{-1/2} V^T E[xx^T] V \Lambda ^{-1/2} V^T = V \Lambda ^{-1/2} V^T \Sigma V \Lambda ^{-1/2} V^T = I }[/math]

Note that the positive definiteness assumption of [math]\displaystyle{ \Sigma }[/math] does not always hold making it impossible to do such a transformation in that case.

Kernel Centering

Given [math]\displaystyle{ n }[/math] data points that are accessible only through a kernel function and we are interested in the kernel matrix of a centered version of data points (with the mean subtracted from all points). Let the explicit data representation in the kernel space be [math]\displaystyle{ \Phi \in \mathbb{R}^{d \times n} }[/math]. An explicit centering of [math]\displaystyle{ \Phi }[/math] is given by [math]\displaystyle{ \Phi-\frac{1}{n}\textbf{e}\textbf{e}^T\Phi }[/math], where [math]\displaystyle{ \textbf{e} }[/math] is a vector of all ones of size [math]\displaystyle{ n }[/math].

Computing the kernel of the centered data:

[math]\displaystyle{ K_{cnt} = (\Phi-\frac{1}{n}\textbf{e}\textbf{e}^T\Phi)^T(\Phi-\frac{1}{n}\textbf{e}\textbf{e}^T\Phi) = \Phi^T\Phi - \frac{1}{n}\textbf{e}\textbf{e}^T\Phi^T\Phi -\frac{1}{n}\Phi^T\Phi\textbf{e}\textbf{e}^T + \frac{1}{n^2}\textbf{e}\textbf{e}^T\Phi^T\Phi\textbf{e}\textbf{e}^T = K - \frac{1}{n}\textbf{e}\textbf{e}^TK -\frac{1}{n}K\textbf{e}\textbf{e}^T + \frac{\textbf{e}^TK\textbf{e}}{n^2}\textbf{e}\textbf{e}^T }[/math]

where [math]\displaystyle{ K }[/math] is the kernel matrix of the original data points.

Locality Sensitive Hashing

The basic idea of Locality Sensitive Hashing [2] is to map each data point to a b-bit vector (called the hashkey) constructed by applying b independent binary-valued hash functions [math]\displaystyle{ h_1,h_2,...,h_b }[/math] to the data points. The hash functions are designed carefully so that similar data points (in terms of some similarity measure) are mapped to the same hashkey with high probability allowing sublinear retrieval of approximate nearest neighbors to any given data point (query).

A valid hash function [math]\displaystyle{ h }[/math] must satisfy the property

[math]\displaystyle{ Pr[h(x_i)= h(x_j)] = sim(x_i, x_j) }[/math]

where [math]\displaystyle{ sim(x_i, x_j) \in [0,1] }[/math] is the similarity between two the data points [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] in terms of the similarity measure for which the hashing scheme is being designed.

For example consider the cosine similarity: [math]\displaystyle{ sim(x_i, x_j) = \frac{x_i^Tx_j}{||x_i||_2 ||x_j||_2} }[/math].

It was shown in [3] that for a vector [math]\displaystyle{ r }[/math] of the same dimension as the original data points and [math]\displaystyle{ r \sim \mathcal{N}(0,\textit{I}) }[/math], that

[math]\displaystyle{ Pr[sign(X_i^Tr) = sign(X_j^Tr)] = 1 - \frac{1}{\pi} cos^{-1} (\frac{x_i^Tx_j}{||x_i||_2 ||x_j||_2}) }[/math]

Based on the result above, the author in [7] proposed the following hashing function for the cosine similarity:

[math]\displaystyle{ h_r(x) = \begin{cases} 1 & r^Tx \ge 0 \\ 0 & \text{otherwise} \end{cases} }[/math] (1)

We can obtain a b-bits vector for each data point [math]\displaystyle{ x }[/math] by applying (1) with b independently sampled vectors [math]\displaystyle{ r_1,r_2, ..., r_b }[/math] from [math]\displaystyle{ \mathcal{N}(0,\textit{I}) }[/math] .

Similarly, different hash functions have been developed in previous work to support other similarity measures (e.g. [math]\displaystyle{ \ell_p }[/math] norm, inner product, and Mahalanobis metric). This paper aims at developing a hashing function that can support arbitrary kernels.