continuous space language models

From statwiki
Revision as of 14:04, 18 November 2015 by Rtwang (talk | contribs) (Created page with "= Introduction = In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

In certain fields of study such as speech recognition or machine translation, for some acoustic signal [math]\,x[/math] or the source sentence to be translated [math]\,e[/math], it is common to model these problems as finding the sequence of words [math]\,w^*[/math] that has the highest probability of occurring given [math]\,x[/math] or [math]\,e[/math]. This can be written as:

[math]w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w)[/math]

An acoustic or translation model can then be used for [math]\,P(x|w)[/math], similar to the idea behind LDA and QDA, and it remains to create a language model [math]\,P(w)[/math] to estimate the probability of any sequence of words [math]\,w[/math].

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate [math]\,P(w)[/math].

Back-off n-grams Model

A sequence of words will be defined as [math]\,w^i_1=(w_1,w_2,\dots,w_i)[/math] and the formula for the probability [math]\,P(w)[/math] can be rewritten as:

[math]P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1)[/math]

It is common to estimate [math]\,P(w_i|w^{i-1}_1)[/math] through:

[math]\,P(w_i|w^{i-1}_1)\approx\frac{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_i)}{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_{i-1})}[/math]

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

[math]\,P(w_i|w^{i-1}_1) = \begin{cases} n/2, & \mbox{if }n\mbox{ is even} \\ 3n+1, & \mbox{if }n\mbox{ is odd} \end{cases}[/math]

Theory