# Introduction

In certain fields of study such as speech recognition or machine translation, for some acoustic signal $\,x$ or the source sentence to be translated $\,e$, it is common to model these problems as finding the sequence of words $\,w^*$ that has the highest probability of occurring given $\,x$ or $\,e$. This can be written as:

$w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)$

An acoustic or translation model can then be used for $\,P(x|w)$, similar to the idea behind LDA and QDA, and it remains to create a language model $\,P(w)$ to estimate the probability of any sequence of words $\,w$.

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate $\,P(w)$.

# Back-off n-grams Model

A sequence of words will be defined as $\,w^i_1=(w_1,w_2,\dots,w_i)$ and the formula for the probability $\,P(w)$ can be rewritten as:

$P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)$

It is common to estimate $\,P(w_i|w^{i-1}_1)$ through:

$\,P(w_i|w^{i-1}_1)\approx\frac{\mbox{number of occurrence of the sequence} (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence} (w_1,\dots,w_{i-1})}$

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

$\,P(w_i|w^{i-1}_1) = \begin{cases} \frac{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_i)}{\mbox{number of occurrence of the sequence}\ (w_1,\dots,w_{i-1})}, & \mbox{if number of occurrence of}\ (w_1,\dots,w_i)\ \mbox{is greater than some constant K} \\ \alpha P(w_i|w^{i-1}_2), & \mbox{otherwise} \end{cases}$

$\,\alpha$ is typically a discounting factor that is less than 1 to account for the lack of direct data. It usually depends on the word sequence.

The general algorithm is then, if the data set does contain the sequence then calculate probability directly. Otherwise, apply a discounting factor and calculate the conditional probability with the first word in the sequence removed. For example, if the word sequence was "The dog barked" and it did not exist in the training set then the formula would be written as:

$\,P(\mbox{the,dog,barked}|\mbox{the,dog}) \approx \alpha P(\mbox{dog,barked}|\mbox{dog})$

# Model

The researchers for this paper sought to find a better model for this probability than the back-off n-grams model. Their approach was to map the n-1 words sequence onto a multi-dimension continuous space using a layer of neural network followed by another layer to estimate the probabilities of all possible next words. The formulas and model goes as follows:

For some sequence of n-1 words, encode each word using 1 of K encoding, i.e. 1 where the word is indexed and zero everywhere else. Label each 1 of K encoding by $(w_{j-n+1},\dots,w_j)$ for some n-1 word sequence at the j'th word in some larger context.

Let P be a projection matrix common to all n-1 words and let

$\,a_i=Pw_{j-n+i},i=1,\dots,n-1$