# continuous space language models

# Introduction

In certain fields of study such as speech recognition or machine translation, for some acoustic signal [math]\,x[/math] or the source sentence to be translated [math]\,e[/math], it is common to model these problems as finding the sequence of words [math]\,w^*[/math] that has the highest probability of occurring given [math]\,x[/math] or [math]\,e[/math]. This can be written as:

[math]w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)[/math]

An acoustic or translation model can then be used for [math]\,P(x|w)[/math], similar to the idea behind LDA and QDA, and it remains to create a language model [math]\,P(w)[/math] to estimate the probability of any sequence of words [math]\,w[/math].

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate [math]\,P(w)[/math].

# Back-off n-grams Model

A sequence of words will be defined as [math]\,w^i_1=(w_1,w_2,\dots,w_i)[/math] and the formula for the probability [math]\,P(w)[/math] can be rewritten as:

[math]P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)[/math]

It is common to estimate [math]\,P(w_i|w^{i-1}_1)[/math] through:

[math]\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}[/math]

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

[math]\,P(w_i|w^{i-1}_1) = \begin{cases} \frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\ \alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} \end{cases}[/math]

[math]\,\alpha[/math] is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.

The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:

[math]\,P(\mbox{the,dog,barked}|\mbox{the,dog}) \approx \alpha P(\mbox{dog,barked}|\mbox{dog})[/math]

# Model

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by [math](w_{j-n+1},\dots,w_j)[/math] for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

[math]\,a_i=Pw_{j-n+i},i=1,\dots,n-1[/math]

Let H be the weight matrix from the projection layer to the hidden layer and the state of H would be:

[math]\,h=tanh(Ha + b)[/math] where A is the concatenation of all [math]\,a_i[/math] and [math]\,b[/math] is some bias vector

Finally, the output vector would be:

[math]\,o=Vh+k[/math] where V is the weight matrix from hidden to output and k is another bias vector. [math]\,o[/math] would be a vector with same dimensions as the total vocabulary size and the probabilities can be calculated from [math]\,o[/math] by applying the softmax function.

# Optimization and Training

The training was done with standard back-propagation on minimizing the error function:

[math]\,E=\sum_{i=1}^N t_i\ log p_i + \epsilon(\sum_{i,j}h^2_{ij}+\sum_{i,j}v^2_{ij})[/math]

[math]\,t_i[/math] is the desired output vector and the summations inside the epsilon bracket are regularization terms to prevent overfitting of [math]\,H[/math] and [math]\,V[/math].

The researchers applied several optimization techniques to speed up the training process.