stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
Line 3: Line 3:
== Introduction ==
== Introduction ==
Neural machine translation systems must be trained on large corpora consisting of pairs of pre-translated sentences. This paper proposes an unsupervised neural machine translation system, which can be trained without using any such parallel data.
Neural machine translation systems must be trained on large corpora consisting of pairs of pre-translated sentences. This paper proposes an unsupervised neural machine translation system, which can be trained without using any such parallel data.
==Overview of Standard Neural Machine Translation==
Let <math display="inline">S</math> denote the set of words in the source language, and let <math display="inline">T</math> denote the set of words in the target language. Let <math display="inline">\mathcal{Z}_S \subset \mathbb{R}^{n_S}</math> and <math display="inline">\mathcal{Z}_T \subset \mathbb{R}^{n_T}</math> denote the word vectors corresponding to the words of the source and target language respectively. Let
\begin{align}
S'&:=\bigcup_{M=1}^\infty S^M\\
T'&:=\bigcup_{K=1}^\infty T^K\\
\mathcal{Z}_S'&:=\bigcup_{M=1}^\infty \mathcal{Z}_S^M\\
\mathcal{Z}_T'&:=\bigcup_{K=1}^\infty \mathcal{Z}_T^K.
\end{align}
In other words, <math display="inline">S'</math> and <math display="inline">T'</math> are the sets of finite sequences of words from the source and target language, and <math display="inline">\mathcal{Z}_S'</math> and <math display="inline">Z_T'</math> are the sets of finite sequences of corresponding word vectors. For any set <math display="inline">Q</math>, we let <math display="inline">\mathbb{P}(S)</math> denote the set of probability distributions over <math display="inline">Q</math> (eliding measure-theoretic details).  In standard neural machine translation system  (Sutskever, 2014) the encoder <math display="inline">E:S' \to \mathbb{R}^{n_H}</math>  maps a sequence of source words a code vector, while the decoder, <math display="inline">D: \mathbb{R}^{n_H} \to \mathbb{P}(T')</math> maps a code vector to a probability distribution over sequences of target-language words. Typically <math display="inline">E</math> is a mono- or bi- directional LSTM; the code vector is the final hidden state<math display="inline"> h_M</math>(or the concatenation of the two final hidden states); and the decoder is a monodirectional LSTM or GRU that at every step  <math display="inline">k</math> accepts as input a hidden state <math display="inline">l_k</math> and the previous word in the sequence <math display="inline">y_{k-1}</math> and outputs a member of <math display="inline">\mathbb{P}(T)</math>, a probability distribution over the possible next words in the sequence.  The probability that the decoder assigns to a given target sequence <math display="inline">(y_1, \ldots ,y_K) \in T'</math> given a source sequence <math display="inline">(x_1,\ldots, x_M)</math> is computed by initializing the decoder with hidden state <math display="inline">E(x_1,\ldots,x_M)=c\in \mathbb{R}^{n_H}</math> and a special start-of-string token.  At every subsequent step, the RNN is fed the hidden state output in the previous step as well the previous word of the sentence whose probability is being calculated. The probability of the sequence is then the product of the probabilities of its constituent words, where every sequence must terminates with a special end-of-string token.
To train the model, we requires a large corpus of parallels sentences <math display="inline">\{x^r \}_{r=1}^R \subset S'</math> and <math display="inline">\{y^r \}_{r=1}^R \subset T'</math> where <math display="inline">x^r=(x^r_1,\ldots, x^r_{M(r)})</math> and <math display="inline">y^r=(y^r_1,\ldots, y^r_{K(r)})</math>.  We apply stochastic gradient descent, using as our loss function the negative log probability of each target language sentence given the matching source language sentence:
\begin{align}
\mathcal{L}( \{ x^r\}_{r=1}^R , \{ y^r\}_{r=1}^R  )= \sum_{r=1}^R - \log D(E(x^r))(y^r)
\end{align}
==Attention==
The above-described translation model performs poorly on long sentences.  This is supposed to be a consequence of of the difficulty of compressing an arbitrarily-long sequence into a fixed-length code vector <math display="inline">c</math>, and of the dilution of the impact of the code vector <math display="inline">c</math> as the decoder RNN gets  further from it.  To resolve these problems, Bahdanau et al. (2014) introduced attention. In summarize their paper below.
In a neural machine translation system with attention, the fixed-length code vector <math display="inline">c_r</math> is replaced by <math display="inline"> (h_1,\ldots,h_M) \subset \mathbb{R}^{M\times n_{H_s}} </math>, the concatenation for the hidden states produced by the encoder RNN. At time step <math display="inline">k</math> in addition to accepting the previous hidden state <math display="inline">l_k</math> and previous word <math display="inline">y_{k-1}</math> as input, the decoder also accepts a context vector <math display="inline"> u_k \in \mathbb{R}^{n_{H_s}} </math> where
\begin{align}
u_k= \sum_{m=1}^M \alpha_{k,m} h_m
\end{align}
is a weighted sum of the hidden states produced by the encoder on the input sentence, with
\begin{align}
\alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'})  },\\
e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m  ),
\end{align}
where <math display="inline">W,U</math> are learnable weight matrices and <math display="inline">v</math> is a learnable weight vector.  The idea is that the model can learn to decide which hidden states from source sentences are important for computing which words in the target sentences. Empirically, the addition of attention significantly improves neural machine translation


== Overview of unsupervised translation system ==
== Overview of unsupervised translation system ==

Revision as of 17:59, 19 February 2018

Introduction

Neural machine translation systems must be trained on large corpora consisting of pairs of pre-translated sentences. This paper proposes an unsupervised neural machine translation system, which can be trained without using any such parallel data.

Overview of unsupervised translation system

The unsupervised translation system has the following plan:

  • Sentences from both the source and target language are mapped to a common latent vector space.
  • A de-noising auto-encoder loss encourages the latent space representations of sentences to be insensitive to noise.
  • An adversarial loss encourages the latent space representations of source and target sentences to be indistinguishable from each other. The idea is that the latent space representations should reflect the meaning of a sentence, and not the particular language in which it is expressed.
  • A reconstruction loss is computed as follows: sample a sentence from one of the languages, and apply the translation model of the previous epoch to translate it to the other language. Then corrupt this translation with noise. The reconstruction loss encourages the model to able to recover the original sampled sentence from its corrupted translation by passing through the latent vector space.

In what follows I will discuss this plan in more detail.

Word vector alignment

Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages. By "alignment", I mean that their method groups vectors corresponding to words with similar meanings close to one another, regardless of the language of words. Moreover, if word C is the target-language literal translation of the source language word B, then-- after alignment -- C's word vector tends to be the closest target-language word vector to the word vector of B. This unsupervised alignment method is crucial to the translation scheme of the current paper. From now on we denote by [math]\displaystyle{ A: S' \cup T' \to \mathcal{Z}' }[/math] the function that maps source and target language word sequences to their aligned word vectors.

Encoder

The encoder [math]\displaystyle{ E }[/math] reads a sequence of word vectors [math]\displaystyle{ (z_1,\ldots, z_m) }[/math] and outputs a sequence of hidden states [math]\displaystyle{ (h_1,\ldots, h_m) }[/math] in the latent space. Crucially, because the word vectors of the two languages have been aligned, the same encoder can be applied to both. That is, to map a source sentence [math]\displaystyle{ x=(x_1,\ldots, x_M) }[/math] to the latent space, we compute [math]\displaystyle{ E(A(x)) }[/math], and to map a target sentence [math]\displaystyle{ x=(y_1,\ldots, y_K) }[/math] to the latent space, we apply [math]\displaystyle{ E(A(y)) }[/math].

The encoder consists of two LSTMs, one of which reads the word-vector sequence in the forward direction, and one of which reads it in the backward direction. The hidden state sequence is generated by concatenating the hidden states produced by the forward and backward LSTM at each word vector.

Decoder

The decoder is a mono-directional LSTM that reads a sequence of hidden states [math]\displaystyle{ (h_1,\ldots, h_m) }[/math] from the latent space and


Overview of objective

The objective function is the sum of three terms:

  1. The de-noising auto-encoder loss
  2. The translation loss
  3. The adversarial loss


References

  1. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
  2. Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)
  3. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).
  4. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.