Difference between revisions of "continuous space language models"

From statwiki
Jump to: navigation, search
(Undo revision 27215 by Drishi (talk))
(Undo revision 27214 by Drishi (talk))
Line 11: Line 11:
  
 
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:
 
Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:
[[File:Qq.png]]
+
 
 
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector
 
<math>\,h=tanh(Ha + b)</math> where A is the concatenation of all <math>\,a_i</math> and <math>\,b</math> is some bias vector
  
 
Finally, the output vector would be:
 
Finally, the output vector would be:
  
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be cal[[File:Q3.png]]culated from <math>\,o</math> by applying the softmax function.
+
<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function.
  
 
The following figure shows the Architecture of the neural network language model. <math>\,h_j</math> denotes the context  <math>\,w_{j-n+1}^{j-1}</math>. P is the size of one projection and H and N is the
 
The following figure shows the Architecture of the neural network language model. <math>\,h_j</math> denotes the context  <math>\,w_{j-n+1}^{j-1}</math>. P is the size of one projection and H and N is the
Line 24: Line 24:
 
[[File:Q3.png]]
 
[[File:Q3.png]]
  
In contrast to standard langua[[File:Qq.png]]ge modeling where we want to know the probability of a word i given its
+
In contrast to standard language modeling where we want to know the probability of a word i given its
context, <math>P(w_{j} = i|h_{j}) </math>, the neural network simultaneously predicts the language model probability of all words
+
context, <math>P(w_{j} = i|h_{j}), the neural network simultaneously predicts the language model probability of all words
 
in the word list:
 
in the word list:
  
[[File:Q4.png]]
+
[[File:Qq.png]]

Revision as of 18:05, 12 December 2015

Model

The neural network language model has to perform two tasks: first, project all words of the context [math]\,h_j[/math] = [math]\,w_{j-n+1}^{j-1}[/math] onto a continuous space, and second, calculate the language model probability [math]P(w_{j}=i|h_{j})[/math]. The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by [math](w_{j-n+1},\dots,w_j)[/math] for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

[math]\,a_i=Pw_{j-n+i},i=1,\dots,n-1[/math]

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

[math]\,h=tanh(Ha + b)[/math] where A is the concatenation of all [math]\,a_i[/math] and [math]\,b[/math] is some bias vector

Finally, the output vector would be:

[math]\,o=Vh+k[/math] where V is the weight matrix from hidden to output and k is another bias vector. [math]\,o[/math] would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from [math]\,o[/math] by applying the softmax function.

The following figure shows the Architecture of the neural network language model. [math]\,h_j[/math] denotes the context [math]\,w_{j-n+1}^{j-1}[/math]. P is the size of one projection and H and N is the size of the second hidden and output layer, respectively. When short-lists are used the size of the output layer is much smaller than the size of the vocabulary.

Q3.png

In contrast to standard language modeling where we want to know the probability of a word i given its context, <math>P(w_{j} = i|h_{j}), the neural network simultaneously predicts the language model probability of all words in the word list:

File:Qq.png