# Difference between revisions of "continuous space language models"

(→Model) |
|||

Line 1: | Line 1: | ||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

= Model = | = Model = | ||

Line 58: | Line 17: | ||

<math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function. | <math>\,o=Vh+k</math> where V is the weight matrix from hidden to output and k is another bias vector. <math>\,o</math> would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from <math>\,o</math> by applying the softmax function. | ||

− | The following figure shows the Architecture of the neural network language model. <math>\,h_j</math> denotes the context <math>\, | + | The following figure shows the Architecture of the neural network language model. <math>\,h_j</math> denotes the context <math>\,w_{j-n+1}^{j-1}</math>. P is the size of one projection and H and N is the |

size of the second hidden and output layer, respectively. When short-lists are used the size of the output layer is much smaller than the size | size of the second hidden and output layer, respectively. When short-lists are used the size of the output layer is much smaller than the size | ||

of the vocabulary. | of the vocabulary. | ||

[[File:Q3.png]] | [[File:Q3.png]] | ||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− |

## Revision as of 18:51, 12 December 2015

# Model

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by [math](w_{j-n+1},\dots,w_j)[/math] for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

[math]\,a_i=Pw_{j-n+i},i=1,\dots,n-1[/math]

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

[math]\,h=tanh(Ha + b)[/math] where A is the concatenation of all [math]\,a_i[/math] and [math]\,b[/math] is some bias vector

Finally, the output vector would be:

[math]\,o=Vh+k[/math] where V is the weight matrix from hidden to output and k is another bias vector. [math]\,o[/math] would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from [math]\,o[/math] by applying the softmax function.

The following figure shows the Architecture of the neural network language model. [math]\,h_j[/math] denotes the context [math]\,w_{j-n+1}^{j-1}[/math]. P is the size of one projection and H and N is the size of the second hidden and output layer, respectively. When short-lists are used the size of the output layer is much smaller than the size of the vocabulary.