continuous space language models

From statwiki
Revision as of 15:04, 18 November 2015 by Rtwang (talk | contribs) (Created page with "= Introduction = In certain fields of study such as speech recognition or machine translation, for some acoustic signal <math>\,x</math> or the source sentence to be translated <...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

In certain fields of study such as speech recognition or machine translation, for some acoustic signal [math]\displaystyle{ \,x }[/math] or the source sentence to be translated [math]\displaystyle{ \,e }[/math], it is common to model these problems as finding the sequence of words [math]\displaystyle{ \,w^* }[/math] that has the highest probability of occurring given [math]\displaystyle{ \,x }[/math] or [math]\displaystyle{ \,e }[/math]. This can be written as:

[math]\displaystyle{ w^* = arg\ \underset {w}{max} P(w|x) = arg\ \underset{w}{max} P(x|w)P(w) }[/math]

An acoustic or translation model can then be used for [math]\displaystyle{ \,P(x|w) }[/math], similar to the idea behind LDA and QDA, and it remains to create a language model [math]\displaystyle{ \,P(w) }[/math] to estimate the probability of any sequence of words [math]\displaystyle{ \,w }[/math].

This is commonly done through the back-off n-grams model and the purpose behind this research paper is to use a neural network to better estimate [math]\displaystyle{ \,P(w) }[/math].

Back-off n-grams Model

A sequence of words will be defined as [math]\displaystyle{ \,w^i_1=(w_1,w_2,\dots,w_i) }[/math] and the formula for the probability [math]\displaystyle{ \,P(w) }[/math] can be rewritten as:

[math]\displaystyle{ P(w^n_1)=P(w_1,w_2,\dots,w_n)=P(w_1)\prod_{i=2}^n P(w_i|w^{i-1}_1) }[/math]

It is common to estimate [math]\displaystyle{ \,P(w_i|w^{i-1}_1) }[/math] through:

[math]\displaystyle{ \,P(w_i|w^{i-1}_1)\approx\frac{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_i)}{number\ of\ occurrence\ of\ the\ sequence\ (w_1,\dots,w_{i-1})} }[/math]

However, it is practically impossible to have a training set large enough to contain every possible sequence of words if the sequence is long enough and some sequences would have an incorrect probability of 0 simply because it is not in the training set. This is known as the data sparseness problem. This problem is commonly resolved by considering only the last n-1 words instead of the whole context. However, even for small n, certain sequences could still be missing.

To solve this issue, a technique called back-off n-grams is used and the general formula goes as follows:

[math]\displaystyle{ \,P(w_i|w^{i-1}_1) = \begin{cases} n/2, & \mbox{if }n\mbox{ is even} \\ 3n+1, & \mbox{if }n\mbox{ is odd} \end{cases} }[/math]

Theory