distributed Representations of Words and Phrases and their Compositionality
Introduction
This paper presents several extensions of the Skip-gram model intriduced by Mikolov et al. [8]. Skip-gram model is an efficient method for learning highquality vector representations of words from large amounts of unstructured text data. The word representations computed using this model are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector. The authors of this paper show that subsampling of frequent words during training results in a significant speedup and improves accuracy of the representations of less frequent words. In addition, a simplified variant of Noise Contrastive Estimation (NCE) [4] for training the Skip-gram model is presented that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work [8]. It also shown that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. For example, vec(“Russia”) + vec(“river”) is close to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).
The Skip-gram Model
The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words [math]\displaystyle{ w_1, w_2,..., w_T }[/math] the objective of the Skip-gram model is to maximize the average log probability:
[math]\displaystyle{
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c\leq j\leq c} log(p(w_{t+j}|w_t))
}[/math]
where [math]\displaystyle{ c }[/math] is the size of the training context (which can be a function of the center word [math]\displaystyle{ w_t }[/math]) and [math]\displaystyle{ p(w_{t+j}|w_t) }[/math] is defined using softmax function:
[math]\displaystyle{ p(w_O|w_I) = \frac{exp ({v'_{W_O}}^T v_{W_I})}{\sum{w=1}^{W} exp ({v'_{W}}^T v_{W_I})} }[/math]
Here, [math]\displaystyle{ v_w }[/math] and [math]\displaystyle{ v'_w }[/math] are the “input” and “output” vector representations of [math]\displaystyle{ w }[/math], and [math]\displaystyle{ W }[/math] is the number of words in the vocabulary.
Hierarchical Softmax
Hierarchical Softmax is a computationally efficient approximation of the full softmax [12]. Hierarchical Softmax evaluate only about [math]\displaystyle{ log_2(W) }[/math] output nodes instead of evaluating [math]\displaystyle{ W }[/math] nodes in the neural network to obtain the probability distribution.
The hierarchical softmax uses a binary tree representation of the output layer with the [math]\displaystyle{ W }[/math] words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.
Let [math]\displaystyle{ n(w,j) }[/math] be the [math]\displaystyle{ j^{th} }[/math] node on the path from the root to [math]\displaystyle{ w }[/math], and let [math]\displaystyle{ L(w) }[/math] be the length of this path, so [math]\displaystyle{ n(w,1) = root }[/math] and [math]\displaystyle{ n(w,L(w)) = w }[/math]. In addition, for any inner node [math]\displaystyle{ n }[/math], let [math]\displaystyle{ ch(n) }[/math] be an arbitrary fixed child of [math]\displaystyle{ n }[/math] and let [math]\displaystyle{ [[x]] }[/math] be 1 if [math]\displaystyle{ x }[/math] is true and -1 otherwise. Then the hierarchical softmax defines [math]\displaystyle{ p(w_O|w_I ) }[/math] as follows:
[math]\displaystyle{ p(w|w_I) = \prod_{j=1}^{L(w)-1} \sigma ([[n(w,j+1)=ch(n(w,j))]]{v'_{n(w,j)}}^T v_{W_I}) }[/math]
where
[math]\displaystyle{ \sigma (x)=\frac{1}{1+exp(-x)} }[/math]
In this paper, a binary Huffman tree is used as the structure for the hierarchical softmax because it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models [5,8].
Negative Sampling
Noise Contrastive Estimation (NCE) is an alternative to the hierarchical softmax. NCE indicates that a good model should be able to differentiate data from noise by means of logistic regression. While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. Negative sampling (NEG) is defined by the objective:
[math]\displaystyle{ log \sigma ({v'_{W_O}}^T v_{W_I})+\sum_{i=1}^{k} \mathbb{E}_{w_i\sim P_n(w)}[log \sigma ({-v'_{W_i}}^T v_{W_I})] }[/math]
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximatelymaximizes the log probability of the softmax, this property is not important for our application.