Representations of Words and Phrases and their Compositionality
Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.
- F. Jiang
- J. Hu
- Y. Zhang
Skip Gram Model
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index)
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:
The probability function represent the frequency of the word in the dataset.
Subsampling of Frequent Words
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.
 Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.
 McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com