stat946f15/Sequence to sequence learning with neural networks: Difference between revisions
No edit summary |
No edit summary |
||
Line 20: | Line 20: | ||
The purpose of | The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math> | ||
<math>\,p(y_1, | |||
Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then | |||
<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math> | |||
For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at point <math>\,t</math> would output the relative probability of each word in the vocabulary and softmax function can be applied to this output vector to generate the corresponding probability. | |||
=== Input and Output Data Transformation === | === Input and Output Data Transformation === | ||
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence. | |||
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence and thus allows the neural network to | |||
= Training and Results = | = Training and Results = | ||
=== Training Method === | === Training Method === | ||
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer. Thus, <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. stochastic gradient with a batch size of 128 and learning rate of 0.7 was used. | |||
=== Results === | === Results === |
Revision as of 17:29, 16 October 2015
Introduction
The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,
I am in the back of the car.
My back hurts.
Deep neural networks have proven to be very capable in solving some other difficult problems such as reproducing sound waves from videos (need source) and a sufficiently complex neural network might provide an excellent solution in this case as well. The purpose of the paper is to apply multi-layer long short-term memory neural networks to this machine language translation problem and assess the accuracy in translation for this approach.
Model
Long Short-Term Memory Recurrent Neural Network (LSTM)
Recurrent neural networks are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers. Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks whereas they can impact current input in a recurrent neural network.
This form of input fits naturally with language translation since sentences are sequences of words and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies where inputs many time steps back can heavily influence current output is difficult and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, long short-term memory neural network, was used instead for this paper as they do not suffer as much from vanishing gradient problem.
The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, [math]\displaystyle{ \,(y_1,\cdots,y_{T'}) }[/math], based on the input sequence, [math]\displaystyle{ \,(x_1,\cdots,x_{T}) }[/math], where [math]\displaystyle{ \,T }[/math] does not have to equal [math]\displaystyle{ \,T' }[/math]
Let [math]\displaystyle{ \,v }[/math] represent the state of hidden layers after [math]\displaystyle{ \,(x_1,\cdots,x_{T}) }[/math] have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then
[math]\displaystyle{ \,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1}) }[/math]
For each [math]\displaystyle{ \,p(y_t|v,y_1,\cdots,y_{t-1}) }[/math], The LSTM neural network at point [math]\displaystyle{ \,t }[/math] would output the relative probability of each word in the vocabulary and softmax function can be applied to this output vector to generate the corresponding probability.
Input and Output Data Transformation
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.
Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence and thus allows the neural network to
Training and Results
Training Method
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer. Thus, [math]\displaystyle{ \,v }[/math] can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. stochastic gradient with a batch size of 128 and learning rate of 0.7 was used.