Hash Embeddings for Efficient Word Representations

From statwiki
Jump to: navigation, search

Introduction

ALmost all neural networks rely on loss function that is continuous in nature to compute the gradients for training hence because of this for whatever data we need to feed in the network has to be continuous in nature. Images can easily be represented as real-valued vectors in form of pixel values and colour intensities but this is not the same case with text input because if converted it would represent a discrete-valued distribution. There are different methods like Word2Vec or GloVe to get the specified word embeddings of a corpus. But the problem faced by these models is the number of parameters it needs to learn is quite high. There have been some solutions to it:

  • Ignore infrequent words: if we calculate the frequency of each word and filter out the least frequent words the number of parameters can be reduced but the problem with this technique is we cannot arrive at a consensus what is the best threshold for filtering.
  • Feature pruning: we can prune the feature from the embedding vector but this pruning is not possible for many models.
  • Compression: The vectors can be compressed with the help of quantization or clustering them together with the help of previously determined centroids.

For some models constructing dictionaries also is a problem which is solved by feature hashing wherein each word [math] w ∈ \tau [/math] ([math]\tau [/math] is the token space of the corpus) is assigned to a fixed bucket. If the number of buckets is low it results in collisions hence it requires us to learn the best hash function for the problem.

In the paper, authors propose to use [math]\kappa[/math] different hash functions and then train it with some parameters to find the best hash function for that particular word/token/phrase. This method has a lot of advantages and helps in reduction of parameters in the word embeddings.


Related Works

Argerich et al proposed an application of feature hashing for words called "Hash2Vec" in which they compare their results with GloVe and find that with word vectors obtained with simple feature hashing has performance equivalent to the GloVe if not better.

Hash Embeddings

Building the hash vector for the word “horse”

To generate the hash embeddings following procedure is followed.

  • They use [math]k[/math] different functions [math]H_1, . . . , H_k[/math] to choose [math]k[/math] component vectors for the token w from a predefined pool of B shared component vectors.
  • Combine the chosen component vectors from step 1 as a weighted sum given by [math] \hat{e}_w = \sum_{i=1}^k p_w^i H_i(w)[/math].
  • Importance paramaters can be concatenated with the [math] \hat{e}_w [/math] to get the final hash vector [math] e_w [/math]

The same equation can be written in vector notations as:

[math]c_w = (H_1(w), H_2(w), . . . , H_k(w))^⊤ [/math]

[math]p_w = (p_w^1, . . . , p_w^k)^⊤ [/math]

[math]\hat{e}_w = p_w^⊤c_w[/math]

[math]e_w^⊤ = \hat{e}_w^⊤⊕ p_w^⊤[/math]

( [math]⊕[/math] is the concatenation operator)

The token to component vector functions [math] H_i(w) = E_{D_2 (D_1(w))} [/math] is implemented as :

  • [math] D_1 : \tau → {1, . . . K} [/math] is a token to id function
  • [math] D_2 : {1, . . . , K} → {1, . . . B} [/math] is an id to bucket functions
  • [math]E[/math] is a [math]B × d [/math] matrix


We can replace [math]D_1 [/math] with normal enumeration if we choose to have a dictionary of the words for small values of [math]\tau[/math]

The importance parameter vectors pw are represented as rows in a [math] K × k [/math] matrix named [math]P [/math], and the token to importance vector mapping is implemented by [math]w → P_{\hat{D}_(w)}[/math] The authors decide use </math> D_1<math> as the function