Hash Embeddings for Efficient Word Representations

From statwiki
Revision as of 15:10, 26 October 2017 by Rahulniyer (talk | contribs) (Created page with "=Introduction= ALmost all neural networks rely on loss function that is continuous in nature to compute the gradients for training hence because of this for whatever data we n...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

ALmost all neural networks rely on loss function that is continuous in nature to compute the gradients for training hence because of this for whatever data we need to feed in the network has to be continuous in nature. Images can easily be represented as real-valued vectors in form of pixel values and colour intensities but this is not the same case with text input because if converted it would represent a discrete-valued distribution. There are different methods like Word2Vec or GloVe to get the specified word embeddings of a corpus. But the problem faced by these models is the number of parameters it needs to learn is quite high. There have been some solutions to it:

  • Ignore infrequent words: if we calculate the frequency of each word and filter out the least frequent words the number of parameters can be reduced but the problem with this technique is we cannot arrive at a consensus what is the best threshold for filtering.
  • Feature pruning: we can prune the feature from the embedding vector but this pruning is not possible for many models.
  • Compression: The vectors can be compressed with the help of quantization or clustering them together with the help of previously determined centroids.

For some models constructing dictionaries also is a problem which is solved by feature hashing wherein each word [math]\displaystyle{ w ∈ \tau }[/math] ([math]\displaystyle{ \tau }[/math] is the token space of the corpus) is assigned to a fixed bucket. If the number of buckets is low it results in collisions hence it requires us to learn the best hash function for the problem.

In the paper, authors propose to use [math]\displaystyle{ \kappa }[/math] different hash functions and then train it with some parameters to find the best hash function for that particular word/token/phrase. This method has a lot of advantages and helps in reduction of parameters in the word embeddings.


Related Works