# Difference between revisions of "neural Machine Translation: Jointly Learning to Align and Translate"

This summary is currently in progress.. Thank you for your patience!

# Introduction

In this paper Bahdanau et al (2015) presents a new way of using neural networks to perform machine translation. Rather than using the typical RNN encoder-decoder model with fixed-length intermediate vectors, they proposed a method that uses a joint learning process for both alignment and translation, and does not restrict intermediate encoded vectors to any specific fixed length. The result is a translation method that is comparable in performance to phrase-based systems (the state-of-the-art effective models that do not use a neural network approach), additionally it has been found the proposed method is more effective compared to other neural network models when applied to long sentences.

# Previous methods

In order to better appreciate the value of this paper's contribution, it is important to understand how earlier techniques approached the problem of machine translation using neural networks.

In machine translation, the problem at hand is to identify the target sentence $y$ (in natural language $B$) that is the most likely corresponding translation to the source sentence $x$ (in natural language $A$). The authors compactly summarize this problem using the formula $\arg\max_{y} P(y|x)$.

Recent Neural Network approaches proposed by researchers such as Kalchbrenner and Blunsom, Cho et al., Sutvesker et al. has built a neural machine translation to directly learn the conditional probability distribution between input $x$ and output $y$. Experiments at current show that neural machine translation or extension of existing translation systems using RNNs perform better compared to state of the art systems.

## Encoding

Typically, the encoding step iterates through the input vectors in the representation of source sentence $x$ and updates a hidden state with each new token in the input: $h_t = f(x_t, h_{t-1})$, for some nonlinear function $f$. After the entire input is read, the resulting fixed-length representation of the entire input sentence $x$ is given by a nonlinear function $q$ of all of the hidden states: $c = q(\{h_1, \ldots, h_{T_x}\})$. Different methods would use different nonlinear functions and different neural networks, but the essence of the approach is common to all.

## Decoding

Decoding the fixed-length representation $c$ of $x$ is done by predicting one token of the target sentence $y$ at a time, using the knowledge of all previously predicted words so far. The decoder defines a probability distribution over the possible sentences using a product of conditional probabilities $P(y) = \Pi_t P(y_t|\{y_1, \ldots, y_{t-1}\},c)$.

In the neural network approach, the conditional probability of the next output term given the previous ones $P(y_t | \{y_1, \ldots, y_{t-1}\},c)$ is given by the evaluation of a nonlinear function $g(y_{t-1}, s_t, c)$, where $s_t$ is the hidden state of the RNN.

# The proposed method

The method proposed here is different from the traditional approach because it bypasses the fixed-length context vector $c$ altogether, and instead aligns the tokens of the translated sentence $y$ directly with the corresponding tokens of source sentence $x$ as it decides which parts might be most relevant. To accommodate this, a different neural network structure needs to be set up.

## Encoding

The proposed model does not use an ordinary recursive neural network to encode the target sentence $x$, but instead uses a bidirectional recursive neural network (BiRNN): this is a model that consists of both a forward and backward RNN, where the forward RNN takes the input tokens of $x$ in the correct order when computing hidden states, and the backward RNN takes the tokens in reverse. Thus each token of $x$ is associated with two hidden states, corresponding to the states it produces in the two RNNs. The annotation vector $h_j$ of the token $x_j$ in $x$ is given by the concatenation of these two hidden states vectors.

## Aligment

An alignment model (in the form of a neural network) is used to measure how well each annotation $h_j$ of the input sentence corresponds to the current state of constructing the translated sentence (represented by the vector $s_{i-1}$, the hidden state of the RNN that identifies the tokens in the output sentence $y$. This is stored as the energy score $e_{ij} = a(s_{i-1}, h_j)$.

The energy scores from the alignment process are used to assign weights $\alpha_{ij}$ to the annotations, effectively trying to determine which of the words in the input is most likely to correspond to the next word that needs to be translated in the current stage of the output sequence:

$\alpha_{ij} = \frac{\exp(e_{ij})}{\Sigma_k \exp(e_{ik})}$

The weights are then applied to the annotations to obtain the current context vector input:

$c_i = \Sigma_j \alpha_{ij}h_j$

Note that this is where we see one major difference between the proposed method and the previous ones: The context vector, or the representation of the input sentence, is not one fixed-length static vector $c$; rather, every time we translate a new word in the sentence, a new representation vector $c_i$ is produced. This vector depends on the most relevant words in the source sentence to the current state in the translation (hence it is automatically aligning) and allows the input sentence to have a variable length representation (since each annotation in the input representation produces a new context vector $c_i$).

## Decoding

The decoding is done by using an RNN to model a probability distribution on the conditional probabilities

$P(y_i | y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)$

where here, $s_i$ is the RNN hidden state at the previous time step, and $c_i$ is the current context vector representation as discussed above under Alignment.

Once the encoding and alignment are done, the decoding step is fairly straightforward and corresponds with the typical approach of neural network translation systems, although the context vector representation is now different at each step of the translation.

## Experiment Settings

The ACL WMT '14 dataset containing English to French translation were used to assess the performance of the Bahdanau et al(2015)'s <ref>Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).</ref> RNNSearch and RNN Encoder-Decoder proposed by Cho et al (2014) <ref>Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).</ref>.

The WMT '14 dataset actually contains the following corpora, totaling (850M words):

• Europarl (61M words)
• News Commentary (5.5M words)
• UN (421M words)
• Crawled corpora (90M and 272.5 words)

This was reduced to 348M using data selection method described by Axelord, et al (2011)<ref>Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 355–362. Association for Computational Linguistics.</ref>.

Both models were trained in the same manner, by using minibatch stochastic gradient descent (SGD) with size 80 and AdaDelta. Once the model has finished training, beam search is used to decode the computed probability distribution to obtain a translation output.

# Results

The authors performed some experiments using the proposed model of machine translation, calling it "RNNsearch", in comparison with the previous kind of model, referred to as "RNNencdec". Both models were trained on the same datasets for translating English to French, with one dataset containing sentences of length up to 30 words, and the other containing sentences with at most 50.

Quantitatively, the RNNsearch scores exceed RNNencdec by a clear margin. The distinction is particularly strong in longer sentences, which the authors note to be a problem area for RNNencdec -- information gets lost when trying to "squash" long sentences into fixed-length vector representations.

The following graph, provided in the paper, shows the performance of RNNsearch compared with RNNencdec, based on the BLEU scores for evaluating machine translation.

Qualitatively, the RNNsearch method does a good job of aligning words in the translation process, even when they need to be rearranged in the translated sentence. Long sentences are also handled very well: while RNNencdec is shown to typically lose meaning and effectiveness after a certain number of words into the sentence, RNNsearch seems robust and reliable even for unusually long sentences.