# Introduction

The emergence of the Internet and other modern technology has greatly increased people's ability to communicate across vast distances and barriers. However, there still remains the fundamental barrier of languages and as anyone who has attempted to learn a new language can attest, it takes tremendous amount of work to learn more than one language past childhood. The ability to efficiently and quickly translate between languages would then be of great importance. This is an extremely difficult problem however as languages can have varying grammar and context always plays an important role. For example, the word "back" means entirely different things in the following two sentences,

I am in the back of the car.

My back hurts.

Applying Deep Neural Networks (DNNs) to this problem is difficult given that DNNs can only be applied to problems where the inputs and output vectors are of fixed dimensions. This is suitable for applications such as image processing where the dimensions is a known a priori, however in applications such as speech recognition, the dimension is not known. Thus, the goal of this paper is to introduce a domain independent method that learns to map sequences of input vectors to output vectors. Sutskever et al has approached this problem by applying Multi-Layer Long Short-Term Memory (LSTM) architecture, and used this architecture to estimate a conditional probability between input and output sequences. Specifically, they used one LSTM to obtain a large fixed-dimensional representation and another to extract the output sequence from that vector.

The main result of this work is that on the WMT' 14 English to French translation task, their model obtained a BLEU score of 34.81 by extracting translations from an ensemble of 5 LTSMs. This is by far the best result achieved by direct translation from an artificial neural network. Also, the LSTM model did not suffer from long sentences, contrary to the recent experiences from researchers using similar architectures. Their model performed well on long sentences because they reversed the source sentences in the training and testing set. Reversing the sentences is a simple trick but it is one of the key contributions of their work.

# Model

### Long Short-Term Memory Recurrent Neural Network (LSTM)

Recurrent neural networks are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers. Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks whereas they can impact current input in a recurrent neural network.

This form of input fits naturally with language translation since sentences are sequences of words and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies where inputs many time steps back can heavily influence current output is difficult and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, long short-term memory neural network, was used instead for this paper as they do not suffer as much from vanishing gradient problem.

The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, $\,(y_1,\cdots,y_{T'})$, based on the input sequence, $\,(x_1,\cdots,x_{T})$, where $\,T$ does not have to equal $\,T'$

Let $\,v$ represent the state of hidden layers after $\,(x_1,\cdots,x_{T})$ have been inputted into the LSTM, i.e. what has been stored in the neural network's memory, then

$\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})$

For each $\,p(y_t|v,y_1,\cdots,y_{t-1})$, The LSTM neural network at time step $\,t$ after $\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})$ have been inputted would output the relative probability of each word in the vocabulary and softmax function, $\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,$ can be applied to this output vector to generate the corresponding probability. From this, we can calculate any $\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})$ by repeatedly adding $\,y_t$ as input into the LSTM neural network to calculate the new set of probabilities.

The objective function used during the training process was:

$\,\frac{1}{|T_r|}\sum_{(S,T)\in T_r} log(p(T|S))\,$

Where $\,S$ is the base/source sentence, $\,T$ is the paired translated sentence and $\,T_r$ is the total training set. This objective function is to maximize the log probability of a correct translation $\,T$ given the base/source sentence $\,S$ over the entire training set. Once the training is complete, translations are produced by fining the most likely translation according to LSTM:

$\hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S)$

### Input and Output Data Transformation

About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.

Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.

To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen.

# Training and Results

### Training Method

Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and $\,v$ can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08. LSTM does not suffer from the vanishing gradient problem, but it can be affected by exploding gradients which is taken into account by enforcing a hard constraint on the norm of the gradient.

### Scoring Method

Scoring was done using the BLEU (Bilingual Evaluation Understudy) metric. This is an algorithm created for evaluating the quality of machine translated text. This is done by using a modified form of precision to compare a produced translation against a set of reference translations. This metric tends to correlate well with human judgement across a corpus, but performs badly if used to evaluate individual sentences. More information can be found in the BLEU paper and the wikipedia article. These resources both state that the BLEU score is a number between 0 and 1, with closer to 1 corresponding to a better translation. The LSTM paper reports scores greater than 1, so it is not clear from the paper itself how this score is being measured, as well as what a so called 'perfect' score would be.

### Results

The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.

Active Voice: I ate an apple.

Passive Voice: The apple was eaten by me.

An interesting result is the fact that reversing the source sentences (not test sentences) improved the long sentence decoding, which in turn increased the BLEU score from 25.9 to 30.6. While the authors do not have a complete explanation, they theorize the improvement in performance is due to the introduction of many short term dependencies to the data-set, by reversing the source sentences they minimize the time lag between the end of the source and the start of the target sentence. This reduction in the time lag is what the authors believe help the LSTM architecture establish a link between source and target and utilize the memory feature of the network better.

For example, let "I saw the man" be the source sentence, "with the binoculars" be the target sentence, if we concatenate both source and target sentences we have "I saw the man with the binoculars". By reversing the source sentence ("man the saw I") the subject "man" is now closer to the context target "binoculars", compared to if the source sentence is not reversed.

In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].

# Source

Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27 3104–3112 (2014).