http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Jj2hu&feedformat=atom
statwiki - User contributions [US]
2022-09-26T03:58:07Z
User contributions
MediaWiki 1.28.3
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40694
Representations of Words and Phrases and their Compositionality
2018-11-21T16:13:53Z
<p>Jj2hu: /* References */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method is described using the soft-max function <math>(1)</math> for the purposes of computing <math>\nabla \log p(w_{O}|w_{I})</math> for the backpropagation stages, practical considerations make it difficult to calculate this gradient, particularly if the corpus contains many words since it computational complexity scales linearly with <math>W</math><br />
<br />
<math>(1)\quad p(w_{O}|w_{I}) = \frac{exp(v_{w_{O}}^{'T}v_{w_{I}})}{\sum_{w=1}^W exp(v_{w_{O}}^{'T}v_{w_{I}})} </math><br />
where <math>v_{w}</math> and <math>v_{w}'</math> are the input and output representations of <math>w</math> and <math>W</math> is the vocabulary size <br />
<br />
Instead, the Skip-gram model described in Mikolov et al. used an approximation called Hierarchical softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes, hierarchical soft-max can instead evaluate <math>\log W</math> nodes. This is done by encoding the output layer using a binary or Huffman tree where the <math>W</math> words are represented as leaves and each node represents the relative probability of all its child nodes. Optimal asymptotic performance can be achieved if the tree is a balanced binary tree, in which case <math>\log W</math> complexity is possible. Soft-max probabilities are calculated using <math>()</math>.<br />
<br />
<math>(2)\quad p(w_{O}|w_{I}) = \prod_{j=1}^{L(w)-1} \sigma \Bigl(\bigl\|n(w,j+1)=ch(n(w,j))\bigr\| \cdot v_{w_{O}}^{'T}v_{w_{I}} \Bigr) </math><br />
where <math>\sigma(x) = 1/(1+exp(-x))</math>, <math>n(w,j)</math> be the j-th node on the path from the root to w, let <math>ch(n)</math> be an arbitrary fixed child of n and let <math> \|x\| </math> be 1 if x is true and -1 otherwise.<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words =<br />
Frequently occurring words often do not provide as much information as a rare word. For example, the word-pair "boat, sea" is likely to occur far more likely than the word "boat, the", yet the former provides the opportunity to encode important contextual information. <br />
<br />
This is a typical case: word pairs containing commonly occurring words often do not provide as much information as rare words. Thus, in order to speed up our implementation of Speed-gram, we discard the word <math>w_{i}</math> from our sample text with probability <math> P(w_{i})=1-\sqrt{\frac{t}{f(w_{i}}} </math><br />
where <math>f(w_{i})</math> is the frequency of word <math>w_{i}</math> and <math>t</math> is a chosen threshold, typically around <math>10^{-5}</math>.<br />
<br />
As the probability of encountering a word decreases, the chance of discarding it decreases and approaches 0 as the frequency of the word approaches <math>10^{-5}</math>. The figure <math>t</math> was chosen empirically as it was shown to work well in practice; the chosen threshold aggressively sub-samples words that appear more frequently than <math>t</math> while preserving the ranking of the frequencies. One thing to note is that the function <math> P(w_{i}) </math> can have undefined behavior if a word with frequency less than <math>t</math> occurs; a simple solution is to fix <math> P(w_{i}) = 0 </math> for any such word. <br />
<br />
This procedure provides a significant speedup to our algorithm as there are a lot of frequently occurring words that can be cut, yet they often encode minimally important information. As the results show, both accuracy and training speed increase.<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546. ''(The paper in question)''<br />
<br />
[2] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 ''(Earlier paper)''<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40693
Representations of Words and Phrases and their Compositionality
2018-11-21T16:13:32Z
<p>Jj2hu: /* References */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method is described using the soft-max function <math>(1)</math> for the purposes of computing <math>\nabla \log p(w_{O}|w_{I})</math> for the backpropagation stages, practical considerations make it difficult to calculate this gradient, particularly if the corpus contains many words since it computational complexity scales linearly with <math>W</math><br />
<br />
<math>(1)\quad p(w_{O}|w_{I}) = \frac{exp(v_{w_{O}}^{'T}v_{w_{I}})}{\sum_{w=1}^W exp(v_{w_{O}}^{'T}v_{w_{I}})} </math><br />
where <math>v_{w}</math> and <math>v_{w}'</math> are the input and output representations of <math>w</math> and <math>W</math> is the vocabulary size <br />
<br />
Instead, the Skip-gram model described in Mikolov et al. used an approximation called Hierarchical softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes, hierarchical soft-max can instead evaluate <math>\log W</math> nodes. This is done by encoding the output layer using a binary or Huffman tree where the <math>W</math> words are represented as leaves and each node represents the relative probability of all its child nodes. Optimal asymptotic performance can be achieved if the tree is a balanced binary tree, in which case <math>\log W</math> complexity is possible. Soft-max probabilities are calculated using <math>()</math>.<br />
<br />
<math>(2)\quad p(w_{O}|w_{I}) = \prod_{j=1}^{L(w)-1} \sigma \Bigl(\bigl\|n(w,j+1)=ch(n(w,j))\bigr\| \cdot v_{w_{O}}^{'T}v_{w_{I}} \Bigr) </math><br />
where <math>\sigma(x) = 1/(1+exp(-x))</math>, <math>n(w,j)</math> be the j-th node on the path from the root to w, let <math>ch(n)</math> be an arbitrary fixed child of n and let <math> \|x\| </math> be 1 if x is true and -1 otherwise.<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words =<br />
Frequently occurring words often do not provide as much information as a rare word. For example, the word-pair "boat, sea" is likely to occur far more likely than the word "boat, the", yet the former provides the opportunity to encode important contextual information. <br />
<br />
This is a typical case: word pairs containing commonly occurring words often do not provide as much information as rare words. Thus, in order to speed up our implementation of Speed-gram, we discard the word <math>w_{i}</math> from our sample text with probability <math> P(w_{i})=1-\sqrt{\frac{t}{f(w_{i}}} </math><br />
where <math>f(w_{i})</math> is the frequency of word <math>w_{i}</math> and <math>t</math> is a chosen threshold, typically around <math>10^{-5}</math>.<br />
<br />
As the probability of encountering a word decreases, the chance of discarding it decreases and approaches 0 as the frequency of the word approaches <math>10^{-5}</math>. The figure <math>t</math> was chosen empirically as it was shown to work well in practice; the chosen threshold aggressively sub-samples words that appear more frequently than <math>t</math> while preserving the ranking of the frequencies. One thing to note is that the function <math> P(w_{i}) </math> can have undefined behavior if a word with frequency less than <math>t</math> occurs; a simple solution is to fix <math> P(w_{i}) = 0 </math> for any such word. <br />
<br />
This procedure provides a significant speedup to our algorithm as there are a lot of frequently occurring words that can be cut, yet they often encode minimally important information. As the results show, both accuracy and training speed increase.<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546. ''(The paper in question)''<br />
<br />
[2] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40692
Representations of Words and Phrases and their Compositionality
2018-11-21T16:09:02Z
<p>Jj2hu: /* Subsampling of Frequent Words */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method is described using the soft-max function <math>(1)</math> for the purposes of computing <math>\nabla \log p(w_{O}|w_{I})</math> for the backpropagation stages, practical considerations make it difficult to calculate this gradient, particularly if the corpus contains many words since it computational complexity scales linearly with <math>W</math><br />
<br />
<math>(1)\quad p(w_{O}|w_{I}) = \frac{exp(v_{w_{O}}^{'T}v_{w_{I}})}{\sum_{w=1}^W exp(v_{w_{O}}^{'T}v_{w_{I}})} </math><br />
where <math>v_{w}</math> and <math>v_{w}'</math> are the input and output representations of <math>w</math> and <math>W</math> is the vocabulary size <br />
<br />
Instead, the Skip-gram model described in Mikolov et al. used an approximation called Hierarchical softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes, hierarchical soft-max can instead evaluate <math>\log W</math> nodes. This is done by encoding the output layer using a binary or Huffman tree where the <math>W</math> words are represented as leaves and each node represents the relative probability of all its child nodes. Optimal asymptotic performance can be achieved if the tree is a balanced binary tree, in which case <math>\log W</math> complexity is possible. Soft-max probabilities are calculated using <math>()</math>.<br />
<br />
<math>(2)\quad p(w_{O}|w_{I}) = \prod_{j=1}^{L(w)-1} \sigma \Bigl(\bigl\|n(w,j+1)=ch(n(w,j))\bigr\| \cdot v_{w_{O}}^{'T}v_{w_{I}} \Bigr) </math><br />
where <math>\sigma(x) = 1/(1+exp(-x))</math>, <math>n(w,j)</math> be the j-th node on the path from the root to w, let <math>ch(n)</math> be an arbitrary fixed child of n and let <math> \|x\| </math> be 1 if x is true and -1 otherwise.<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words =<br />
Frequently occurring words often do not provide as much information as a rare word. For example, the word-pair "boat, sea" is likely to occur far more likely than the word "boat, the", yet the former provides the opportunity to encode important contextual information. <br />
<br />
This is a typical case: word pairs containing commonly occurring words often do not provide as much information as rare words. Thus, in order to speed up our implementation of Speed-gram, we discard the word <math>w_{i}</math> from our sample text with probability <math> P(w_{i})=1-\sqrt{\frac{t}{f(w_{i}}} </math><br />
where <math>f(w_{i})</math> is the frequency of word <math>w_{i}</math> and <math>t</math> is a chosen threshold, typically around <math>10^{-5}</math>.<br />
<br />
As the probability of encountering a word decreases, the chance of discarding it decreases and approaches 0 as the frequency of the word approaches <math>10^{-5}</math>. The figure <math>t</math> was chosen empirically as it was shown to work well in practice; the chosen threshold aggressively sub-samples words that appear more frequently than <math>t</math> while preserving the ranking of the frequencies. One thing to note is that the function <math> P(w_{i}) </math> can have undefined behavior if a word with frequency less than <math>t</math> occurs; a simple solution is to fix <math> P(w_{i}) = 0 </math> for any such word. <br />
<br />
This procedure provides a significant speedup to our algorithm as there are a lot of frequently occurring words that can be cut, yet they often encode minimally important information. As the results show, both accuracy and training speed increase.<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40690
Representations of Words and Phrases and their Compositionality
2018-11-21T15:57:31Z
<p>Jj2hu: /* References */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method is described using the soft-max function <math>(1)</math> for the purposes of computing <math>\nabla \log p(w_{O}|w_{I})</math> for the backpropagation stages, practical considerations make it difficult to calculate this gradient, particularly if the corpus contains many words since it computational complexity scales linearly with <math>W</math><br />
<br />
<math>(1)\quad p(w_{O}|w_{I}) = \frac{exp(v_{w_{O}}^{'T}v_{w_{I}})}{\sum_{w=1}^W exp(v_{w_{O}}^{'T}v_{w_{I}})} </math><br />
where <math>v_{w}</math> and <math>v_{w}'</math> are the input and output representations of <math>w</math> and <math>W</math> is the vocabulary size <br />
<br />
Instead, the Skip-gram model described in Mikolov et al. used an approximation called Hierarchical softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes, hierarchical soft-max can instead evaluate <math>\log W</math> nodes. This is done by encoding the output layer using a binary or Huffman tree where the <math>W</math> words are represented as leaves and each node represents the relative probability of all its child nodes. Optimal asymptotic performance can be achieved if the tree is a balanced binary tree, in which case <math>\log W</math> complexity is possible. Soft-max probabilities are calculated using <math>()</math>.<br />
<br />
<math>(2)\quad p(w_{O}|w_{I}) = \prod_{j=1}^{L(w)-1} \sigma \Bigl(\bigl\|n(w,j+1)=ch(n(w,j))\bigr\| \cdot v_{w_{O}}^{'T}v_{w_{I}} \Bigr) </math><br />
where <math>\sigma(x) = 1/(1+exp(-x))</math>, <math>n(w,j)</math> be the j-th node on the path from the root to w, let <math>ch(n)</math> be an arbitrary fixed child of n and let <math> \|x\| </math> be 1 if x is true and -1 otherwise.<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words =<br />
Frequently occurring words often do not provide as much information as a rare word. For example, the word-pair "boat, sea" is likely to occur far more likely than the word "boat, the", yet the latter provides the opportunity to encode important contextual information. <br />
<br />
This is a typical case:word pairs containing commonly occurring words often do not provide as much information as rare words. Thus, in order to speed up our implementation of Speed-gram, we discard the word <math>w_{i}</math> from our sample text with probability <math> P(w_{i})=1-\sqrt{\frac{t}{f(w_{i}}} </math><br />
where <math>f(w_{i})</math> is the frequency of word <math>w_{i}</math> and <math>t</math> is a chosen threshold, typically around <math>10^{-5}</math>.<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40689
Representations of Words and Phrases and their Compositionality
2018-11-21T15:55:13Z
<p>Jj2hu: /* Subsampling of Frequent Words */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method is described using the soft-max function <math>(1)</math> for the purposes of computing <math>\nabla \log p(w_{O}|w_{I})</math> for the backpropagation stages, practical considerations make it difficult to calculate this gradient, particularly if the corpus contains many words since it computational complexity scales linearly with <math>W</math><br />
<br />
<math>(1)\quad p(w_{O}|w_{I}) = \frac{exp(v_{w_{O}}^{'T}v_{w_{I}})}{\sum_{w=1}^W exp(v_{w_{O}}^{'T}v_{w_{I}})} </math><br />
where <math>v_{w}</math> and <math>v_{w}'</math> are the input and output representations of <math>w</math> and <math>W</math> is the vocabulary size <br />
<br />
Instead, the Skip-gram model described in Mikolov et al. used an approximation called Hierarchical softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes, hierarchical soft-max can instead evaluate <math>\log W</math> nodes. This is done by encoding the output layer using a binary or Huffman tree where the <math>W</math> words are represented as leaves and each node represents the relative probability of all its child nodes. Optimal asymptotic performance can be achieved if the tree is a balanced binary tree, in which case <math>\log W</math> complexity is possible. Soft-max probabilities are calculated using <math>()</math>.<br />
<br />
<math>(2)\quad p(w_{O}|w_{I}) = \prod_{j=1}^{L(w)-1} \sigma \Bigl(\bigl\|n(w,j+1)=ch(n(w,j))\bigr\| \cdot v_{w_{O}}^{'T}v_{w_{I}} \Bigr) </math><br />
where <math>\sigma(x) = 1/(1+exp(-x))</math>, <math>n(w,j)</math> be the j-th node on the path from the root to w, let <math>ch(n)</math> be an arbitrary fixed child of n and let <math> \|x\| </math> be 1 if x is true and -1 otherwise.<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words =<br />
Frequently occurring words often do not provide as much information as a rare word. For example, the word-pair "boat, sea" is likely to occur far more likely than the word "boat, the", yet the latter provides the opportunity to encode important contextual information. <br />
<br />
This is a typical case:word pairs containing commonly occurring words often do not provide as much information as rare words. Thus, in order to speed up our implementation of Speed-gram, we discard the word <math>w_{i}</math> from our sample text with probability <math> P(w_{i})=1-\sqrt{\frac{t}{f(w_{i}}} </math><br />
where <math>f(w_{i})</math> is the frequency of word <math>w_{i}</math> and <math>t</math> is a chosen threshold, typically around <math>10^{-5}</math>.<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40688
Representations of Words and Phrases and their Compositionality
2018-11-21T15:34:32Z
<p>Jj2hu: /* Hierarchical Softmax */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method is described using the soft-max function <math>(1)</math> for the purposes of computing <math>\nabla \log p(w_{O}|w_{I})</math> for the backpropagation stages, practical considerations make it difficult to calculate this gradient, particularly if the corpus contains many words since it computational complexity scales linearly with <math>W</math><br />
<br />
<math>(1)\quad p(w_{O}|w_{I}) = \frac{exp(v_{w_{O}}^{'T}v_{w_{I}})}{\sum_{w=1}^W exp(v_{w_{O}}^{'T}v_{w_{I}})} </math><br />
where <math>v_{w}</math> and <math>v_{w}'</math> are the input and output representations of <math>w</math> and <math>W</math> is the vocabulary size <br />
<br />
Instead, the Skip-gram model described in Mikolov et al. used an approximation called Hierarchical softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes, hierarchical soft-max can instead evaluate <math>\log W</math> nodes. This is done by encoding the output layer using a binary or Huffman tree where the <math>W</math> words are represented as leaves and each node represents the relative probability of all its child nodes. Optimal asymptotic performance can be achieved if the tree is a balanced binary tree, in which case <math>\log W</math> complexity is possible. Soft-max probabilities are calculated using <math>()</math>.<br />
<br />
<math>(2)\quad p(w_{O}|w_{I}) = \prod_{j=1}^{L(w)-1} \sigma \Bigl(\bigl\|n(w,j+1)=ch(n(w,j))\bigr\| \cdot v_{w_{O}}^{'T}v_{w_{I}} \Bigr) </math><br />
where <math>\sigma(x) = 1/(1+exp(-x))</math>, <math>n(w,j)</math> be the j-th node on the path from the root to w, let <math>ch(n)</math> be an arbitrary fixed child of n and let <math> \|x\| </math> be 1 if x is true and -1 otherwise.<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40677
Representations of Words and Phrases and their Compositionality
2018-11-21T09:16:20Z
<p>Jj2hu: /* Hierarchical Softmax */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method involves using the soft-max function for the purposes of computing gradients for the backpropagation stages, practical considerations usually make the full soft-max algorithm prohibitively expensive to run, particularly if the corpus is of a large size.<br />
<br />
Instead, the basic Skip-gram model described in Mikolov et al. used an approximation called Hierarchical Softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes hierarchical soft max can instead evaluate <math>\log nodes for relatively simialr performance<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40676
Representations of Words and Phrases and their Compositionality
2018-11-21T09:07:19Z
<p>Jj2hu: /* Skip Gram Model */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method involves using the soft-max function for the purposes of computing gradients for the backpropagation stages, practical considerations usually make the full soft-max algorithm prohibitively expensive to run, particularly if the <br />
Instead, the base Skip-gram model used an approximation called Hierarchical Softmax<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40675
Representations of Words and Phrases and their Compositionality
2018-11-21T09:06:32Z
<p>Jj2hu: /* Hierarchical Softmax */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax =<br />
Although the Skip-gram method involves using the soft-max function for the purposes of computing gradients for the backpropagation stages, practical considerations usually make the full soft-max algorithm prohibitively expensive to run, particularly if the <br />
Instead, the base Skip-gram model used an approximation called Hierarchical Softmax<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.<br />
<br />
[[File:wordembedding_studies.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40673
Representations of Words and Phrases and their Compositionality
2018-11-21T07:54:00Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, <br />
<br />
[[File:wordembedding comparisons.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40672
Representations of Words and Phrases and their Compositionality
2018-11-21T07:53:10Z
<p>Jj2hu: </p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, <br />
<br />
[[File:wordembedding comparisons.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40671
Representations of Words and Phrases and their Compositionality
2018-11-21T07:51:32Z
<p>Jj2hu: /* Skip Gram Model */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, <br />
<br />
[[File:wordembedding comparisons.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40670
Representations of Words and Phrases and their Compositionality
2018-11-21T07:51:19Z
<p>Jj2hu: /* Skip Gram Model */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|thumb|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|thumb|left]]<br />
[[File:table_skip.PNG|thumb|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, <br />
<br />
[[File:wordembedding comparisons.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40669
Representations of Words and Phrases and their Compositionality
2018-11-21T07:50:31Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
This paper, "Distributed Representations of Words and Phrases and their Compositionality" proposes several methods of improving the performance metrics of the Skip-gram model, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png|700px|thumb|center]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]<br />
<br />
The probability function represent the frequency of the word in the dataset.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
[[File:wordembedding empiricalresults.png|700px|thumb|center]]<br />
<br />
Finally, <br />
<br />
[[File:wordembedding comparisons.png|700px|thumb|center]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com<br />
<br />
[3] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40663
Representations of Words and Phrases and their Compositionality
2018-11-21T07:42:38Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.<br />
<br />
Noise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:<br />
<br />
[[File:wordembedding negativesampling.png ]]<br />
<br />
It retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.<br />
<br />
To determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is shown below:<br />
<br />
[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png]]<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
<br />
[[File:wordembedding empiricalresults.png]]<br />
<br />
[[File:wordembedding comparisons.png]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40661
Representations of Words and Phrases and their Compositionality
2018-11-21T07:42:22Z
<p>Jj2hu: /* Skip Gram Model */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Skip Gram Model =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. Negative Sampling solve this problem.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
<br />
[[File:wordembedding empiricalresults.png]]<br />
<br />
[[File:wordembedding comparisons.png]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40660
Representations of Words and Phrases and their Compositionality
2018-11-21T07:41:53Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
The Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.<br />
<br />
Skip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word "ant" is indicated by the 1 at its position in the corpus; everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.<br />
<br />
[[File:skipmod2.PNG|600px|center]] <br />
<br />
The central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a "window size" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that "steering" and "wheel" appear in similar contexts. This means that one is a good predictor of the other, but also that "driving" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) <br />
<br />
[[File:window_skip.PNG|left]]<br />
[[File:table_skip.PNG|center]] <br />
<br />
One advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector in the corpus. In a sense, subtracting "Spain" from "Madrid" extracts the notion of a capital city that when added to "France", produces "Paris". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations ("Do linear operations produce logical results?")<br />
<br />
= Skip Gram Model = <br />
<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling =<br />
<br />
Using the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. Negative Sampling solve this problem.<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
To evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. “quick” : “quickly” :: “slow” : “slowly”). Semantic is when two pairs of words have the same vector meaning. For example, “Berlin” : “Germany” and “Paris” : “France” are semantic analogies.<br />
<br />
<br />
[[File:wordembedding empiricalresults.png]]<br />
<br />
[[File:wordembedding comparisons.png]]<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:window_skip.PNG&diff=40659
File:window skip.PNG
2018-11-21T07:28:16Z
<p>Jj2hu: </p>
<hr />
<div></div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:table_skip.PNG&diff=40658
File:table skip.PNG
2018-11-21T07:27:51Z
<p>Jj2hu: </p>
<hr />
<div></div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40650
Representations of Words and Phrases and their Compositionality
2018-11-21T05:57:14Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
The Skip-gram model is NLP method where a given word ("target") is fed into the model which outputs a vector indicating the probability of finding certain words in the immediate context ("surroundings") <br />
of the target word. Thus, words or phrases that appear together in the training sample with more regularity are deemed to have similar contexts and will result in a model generating higher output probabilities <br />
for one or the other. For example, inputting the word "light" will probably result in an output vector with high values for the words "show" and "bulb". <br />
<br />
[[File:skip_gen.PNG]]<br />
<br />
Skip-gram requires a pre-specified vocabulary of words or phrases containing all possible target and context words to work.<br />
<br />
= Skip Gram Model = <br />
<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling = <br />
<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:skip_gen.PNG&diff=40649
File:skip gen.PNG
2018-11-21T05:56:44Z
<p>Jj2hu: </p>
<hr />
<div></div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40648
Representations of Words and Phrases and their Compositionality
2018-11-21T05:24:32Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
The Skip-gram model is a method of constructing a Word2Vec encoding using a neural network. Individual words are encoded <br />
<br />
[[File:skipmod.JPG|200px|thumb|left|alt text]]<br />
<br />
= Skip Gram Model = <br />
<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling = <br />
<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40647
Representations of Words and Phrases and their Compositionality
2018-11-21T05:23:53Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
The Skip-gram model is a method of constructing a Word2Vec encoding using a neural network. Individual words are encoded <br />
<br />
[[File:skipmod2.PNG|200px|thumb|left|alt text]]<br />
<br />
= Skip Gram Model = <br />
<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling = <br />
<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:skipmod2.PNG&diff=40646
File:skipmod2.PNG
2018-11-21T05:22:54Z
<p>Jj2hu: </p>
<hr />
<div></div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:skipmod.JPG&diff=40645
File:skipmod.JPG
2018-11-21T05:20:16Z
<p>Jj2hu: </p>
<hr />
<div></div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Representations_of_Words_and_Phrases_and_their_Compositionality&diff=40644
Representations of Words and Phrases and their Compositionality
2018-11-21T05:19:30Z
<p>Jj2hu: /* Introduction */</p>
<hr />
<div>Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are till in practice today.<br />
<br />
= Presented by = <br />
*F. Jiang<br />
*J. Hu<br />
*Y. Zhang<br />
<br />
<br />
= Introduction =<br />
The Skip-gram model is a method of constructing a Word2Vec encoding using a neural network.<br />
<br />
= Skip Gram Model = <br />
<br />
<br />
= Hierarchical Softmax = <br />
<br />
<br />
= Negative Sampling = <br />
<br />
<br />
= Subsampling of Frequent Words = <br />
<br />
<br />
= Empirical Results =<br />
<br />
=References=<br />
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546.<br />
<br />
[2] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=38069
stat441F18
2018-11-06T18:26:18Z
<p>Jj2hu: </p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || [https://arxiv.org/pdf/1802.10542.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/images/0/0f/MbPA_Summary.pdf Summary] ||<br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going Deeper with Convolutions ||[https://arxiv.org/pdf/1409.4842.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary]<br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797] || ||<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classiﬁcation || [https://arxiv.org/pdf/1408.5882.pdf paper] || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song || 11|| TBA || || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| || ||<br />
|-<br />
|Makeup || Hudson Ash, Stephen Kingston, Richard Zhang, Alexandre Xiao, Ziqiu Zhu || || || ||<br />
|-<br />
|Makeup || Frank Jiang, Yuan Zhang, Jerry Hu || || || ||<br />
|-<br />
|Makeup || || || || ||<br />
|-<br />
|Makeup || || || || ||</div>
Jj2hu
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=36680
F18-STAT841-Proposal
2018-10-08T02:18:16Z
<p>Jj2hu: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 2'''<br />
Group members:<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' <br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Reproducibility report: *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks*<br />
<br />
'''Description:''' <br />
<br />
The paper *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks* [1] has been submitted to ICLR 2019. It aims to "fix" variational Bayes and turn it into a robust inference tool through two innovations. <br />
<br />
Goals are to: <br />
<br />
- reproduce the deterministic variational inference scheme as described in the paper without referencing the original author's code, providing a 3rd party implementation<br />
<br />
- reproduce experiment results with own implementation, using the same NN framework for reference implementations of compared methods described in the paper<br />
<br />
- reproduce experiment results with the author's own implementation<br />
<br />
- explore other possible applications of variational Bayes besides heteroscedastic regression<br />
<br />
[1] OpenReview location: https://openreview.net/forum?id=B1l08oAct7<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Rebecca, Chen<br />
<br />
Susan,<br />
<br />
Mike, Li<br />
<br />
Ted, Wang<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Quick, Draw! Doodle Recognition Challenge. The dataset in this competition contains 50M drawings among 340 categories and is the subset of the world’s largest doodling dataset and the doodling dataset is updating by real drawing game players. Anyone can contribution by joining it! (quickdraw.withgoogle.com).<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/human-protein-atlas-image-classification Human Protein Atlas Image Classification]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the Human Protein Atlas Image Classification competition featured on Kaggle. We will classify proteins based on patterns seen in microscopic images of human cells.<br />
<br />
Historically, the work done to classify proteins had only developed methods to classify proteins using single patterns of very few cell types at a time. The goal of this challenge is to develop methods to classify proteins based on multiple/mixed patterns and with a larger range of cell types.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Google Analytics Customer Revenue Prediction. In this competition, we will analyze customer dataset from a Google Merchandise Store selling swags to predict revenue per customer using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements<br />
<br />
<br />
'''Description:''' <br />
Stock price is affected by the news to some extent. What is the news influence on stock price and what is the predicted power of the news? <br />
What we are going to do is to use the content of news to predict the tendency of stock price. We will mine the data, finding the useful information behind the big data. As the result we will predict the stock price performance when market faces news.<br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
McLellan, Isaac<br />
<br />
Brewster, Kristi<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Quick, Draw! Doodle Recognition<br />
<br />
'''Description:''' <br />
<br />
'''Background'''<br />
<br />
Google’s Quick, Draw! is an online game where a user is prompted to draw an image depicting a certain category in under 20 seconds. As the drawing is being completed, the game uses a model which attempts to correctly identify the image being drawn. With the aim to improve the underlying pattern recognition model this game uses, Google is hosting a Kaggle competition asking the public to build a model to correctly identify a given drawing. The model should classify the drawing into one of the 340 label categories within the Quick, Draw! Game in 3 guesses or less.<br />
<br />
'''Proposed Approach'''<br />
<br />
Each image/doodle (input) is considered as a matrix of pixel values. In order to classify images, we need to essentially reshape an images’ respective matrix of pixel values - convolution. This would reduce the dimensionality of the input significantly which in turn reduces the number of parameters of any proposed recognition model. Using filters, pooling layers and further convolution, a final layer called the fully connected layer is used to correlate images with categories, assigning probabilities (weights) and hence classifying images. <br />
<br />
This approach to image classification is called convolution neural network (CNN) and we propose using this to classify the doodles within the Quick, Draw! Dataset.<br />
<br />
To control overfitting and underfitting of our proposed model and minimizing the error, we will use different architectures consisting of different types and dimensions of pooling layers and input filters.<br />
<br />
'''Challenges'''<br />
<br />
This project presents a number of interesting challenges:<br />
* The data given for training is noisy in that it contains drawings that are incomplete or simply poorly drawn. Dealing with this noise will be a significant part of our work. <br />
* There are 340 label categories within the Quick, Draw! dataset, this means that the model created must be able to classify drawings based on a large pool of information while making effective use of powerful computational resources.<br />
<br />
'''Tools & Resources'''<br />
<br />
* We will use Python & MATLAB.<br />
* We will use the Quick, Draw! Dataset available on the Kaggle competition website. <https://www.kaggle.com/c/quickdraw-doodle-recognition/data><br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements - A Kaggle Competition<br />
<br />
By analyzing news data to predict stock prices, Kagglers have a unique opportunity to advance the state of research in understanding the predictive power of the news. This power, if harnessed, could help predict financial outcomes and generate significant economic impact all over the world.<br />
<br />
Data for this competition comes from the following sources:<br />
<br />
Market data provided by Intrinio.<br />
News data provided by Thomson Reuters. Copyright ©, Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' Construction of a classifier that can learn from noisy training data and generalize to a clean test set . Training data coming from the Google game "Quick, Draw"<br />
<br />
----------------------------------------------------------------------</div>
Jj2hu