continuous space language models

From statwiki
Revision as of 15:04, 18 November 2015 by Rtwang (talk | contribs) (Created page with "= Introduction = In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

In certain fields of study such as speech recognition or machine translation, for some acoustic signal [math]\displaystyle{ \,x }[/math] or the source sentence to be translated [math]\displaystyle{ \,e }[/math], it is common to model these problems as finding the sequence of words [math]\displaystyle{ \,w^* }[/math] that has the highest probability of occurring given [math]\displaystyle{ \,x }[/math] or [math]\displaystyle{ \,e }[/math]. This can be written as:

[math]\displaystyle{ w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w) }[/math]

An acoustic or translation model can then be used for [math]\displaystyle{ \,P(x|w) }[/math], similar to the idea behind LDA and QDA, and it remains to create a language model [math]\displaystyle{ \,P(w) }[/math] to estimate the probability of any sequence of words [math]\displaystyle{ \,w }[/math].

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate [math]\displaystyle{ \,P(w) }[/math].

Back-off n-grams Model

A sequence of words will be defined as [math]\displaystyle{ \,w^i_1=(w_1,w_2,\dots,w_i) }[/math] and the formula for the probability [math]\displaystyle{ \,P(w) }[/math] can be rewritten as:

[math]\displaystyle{ P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1) }[/math]

It is common to estimate [math]\displaystyle{ \,P(w_i|w^{i-1}_1) }[/math] through:

[math]\displaystyle{ \,P(w_i|w^{i-1}_1)\approx\frac{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_i)}{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_{i-1})} }[/math]

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

[math]\displaystyle{ \,P(w_i|w^{i-1}_1) = \begin{cases} n/2, & \mbox{if }n\mbox{ is even} \\ 3n+1, & \mbox{if }n\mbox{ is odd} \end{cases} }[/math]

Theory