stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only
Introduction
Neural machine translation systems must be trained on large corpora consisting of pairs of pre-translated sentences. This paper proposes an unsupervised neural machine translation system, which can be trained without using any such parallel data.
Overview of unsupervised translation system
The unsupervised translation system has the following plan:
- Sentences from both the source and target language are mapped to a common latent vector space.
- A de-noising auto-encoder loss encourages the latent space representations of sentences to be insensitive to noise.
- An adversarial loss encourages the latent space representations of source and target sentences to be indistinguishable from each other. The idea is that the latent space representations should reflect the meaning of a sentence, and not the particular language in which it is expressed.
- A reconstruction loss is computed as follows: sample a sentence from one of the languages, and apply the translation model of the previous epoch to translate it to the other language. Then corrupt this translation with noise. The reconstruction loss encourages the model to able to recover the original sampled sentence from its corrupted translation by passing through the latent vector space.
In what follows I will discuss this plan in more detail.
Notation
Let [math]\displaystyle{ S }[/math] denote the set of words in the source language, and let [math]\displaystyle{ T }[/math] denote the set of words in the target language. Let [math]\displaystyle{ H \subset \mathbb{R}^{n_H} }[/math] denote the latent vector space. Moreover, let [math]\displaystyle{ S' }[/math] and [math]\displaystyle{ T' }[/math] denote the sets of finite sequences of words in the source and target language, and let [math]\displaystyle{ H' }[/math] denote the set of finite sequences of vectors in the latent space. For any set X, elide measure-theoretic details and let [math]\displaystyle{ \mathcal{P}(X) }[/math] denote the set of probability distributions over X.
Word vector alignment
Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages. By "alignment", I mean that their method groups vectors corresponding to words with similar meanings close to one another, regardless of the language of words. Moreover, if word C is the target-language literal translation of the source language word B, then-- after alignment -- C's word vector tends to be the closest target-language word vector to the word vector of B. This unsupervised alignment method is crucial to the translation scheme of the current paper. From now on we denote by [math]\displaystyle{ A: S' \cup T' \to \mathcal{Z}' }[/math] the function that maps source and target language word sequences to their aligned word vectors.
Encoder
The encoder [math]\displaystyle{ E }[/math] reads a sequence of word vectors [math]\displaystyle{ (z_1,\ldots, z_m) \in \mathcal{Z}' }[/math] and outputs a sequence of hidden states [math]\displaystyle{ (h_1,\ldots, h_m) \in H' }[/math] in the latent space. Crucially, because the word vectors of the two languages have been aligned, the same encoder can be applied to both. That is, to map a source sentence [math]\displaystyle{ x=(x_1,\ldots, x_M)\in S' }[/math] to the latent space, we compute [math]\displaystyle{ E(A(x)) }[/math], and to map a target sentence [math]\displaystyle{ y=(y_1,\ldots, y_K)\in T' }[/math] to the latent space, we compute [math]\displaystyle{ E(A(y)) }[/math].
The encoder consists of two LSTMs, one of which reads the word-vector sequence in the forward direction, and one of which reads it in the backward direction. The hidden state sequence is generated by concatenating the hidden states produced by the forward and backward LSTM at each word vector.
Decoder
The decoder is a mono-directional LSTM that accepts a sequence of hidden states [math]\displaystyle{ h=(h_1,\ldots, h_m) \in H' }[/math] from the latent space and a language and outputs a probability distribution over sequences in that language. We have
\begin{align} D: H' \times \{S,T \} \to \mathcal{P}(S') \cup \mathcal{P}(T'). \end{align}
In detail, the decoder is a mono-directional LSTM that makes use of the attention mechanism of Bahdanau et al. (2014). To compute the probability of a given sentence [math]\displaystyle{ y=(y_1,\ldots,y_K) }[/math] , the LSTM processes the sentence one word at a time, accepting at each step [math]\displaystyle{ k }[/math] the aligned word vector of the previous word in the sentence [math]\displaystyle{ A(y_{k-1}) }[/math] and a context vector [math]\displaystyle{ c_k\in H }[/math] computed from the hidden sequence [math]\displaystyle{ h\in H' }[/math]. The LSTM is initiated with a special, language-specific start-of-sequence token. Otherwise, the decoder is does not depend on the language of the sentence it is producing. The context vector is computed as described by Bahdanau et al. (2014), where we let [math]\displaystyle{ l_{k} }[/math] denote the hidden state of the LSTM at step [math]\displaystyle{ k }[/math], and where [math]\displaystyle{ U,W }[/math] are learnable weight matrices, and [math]\displaystyle{ v }[/math] is a learnable weight vector: \begin{align} c_k&= \sum_{m=1}^M \alpha_{k,m} h_m\\ \alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'}) },\\ e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m ). \end{align}
By learning [math]\displaystyle{ U,W }[/math] and [math]\displaystyle{ v }[/math], the decoder can learn which vectors in the sequence [math]\displaystyle{ h }[/math] are relevant to computing which words in the output sequence.
At step [math]\displaystyle{ k }[/math], after receiving the context vector[math]\displaystyle{ c_k\in H }[/math] and the aligned word vector of the previous word in the sequence,[math]\displaystyle{ A(y_{k-1}) }[/math], the LSTM outputs a probability distribution over words, which should be interpreted as the distribution of the next word according to the decoder. The probability the decoder assigns to a sentence is then the product of the probabilities computed for each word in this manner.
Discriminator
The discriminator [math]\displaystyle{ R }[/math] maps sequences of latent-space vectors to probabilities: \begin{align} R: H' \to [0,1] \end{align} R has the form ...
Overview of objective
The objective function is the sum of three terms:
- The de-noising auto-encoder loss
- The translation loss
- The adversarial loss
I shall describe these in the following sections.
De-noising Auto-encoder
A de-noising auto-encoder is a function optimized to map a corrupted sample from some dataset to the original un-corrupted sample. They were introduced by Vincent et al. (2008), who provided numerous justifications, one of which is particularly illuminating. If we think of the dataset of interest as a thin manifold in a high-dimensional space, the corruption process is likely perturb datapoint off the manifold. To learn to restore the corrupted datapoint, the de-noising auto-encoder must learn the shape of the manifold.
Hill et al. (2016), used a de-noising auto-encoder to learn vectors representing sentences. They corrupted input sentences by randomly dropping and swapping words, and then trained a neural network to map the corrupted sentence to a vector, and then map the vector to the un-corrupted sentence. Interestingly, they found that sentence vectors learned this way were particularly effective when applied to tasks that involved generating paraphrases.
The present paper uses the principal of de-noising auto-encoders to compute one of the terms in its loss function. In each iteration, a sentence is sampled from the source or target language, and a corruption process [math]\displaystyle{ C }[/math] is applied to it. [math]\displaystyle{ C }[/math] works by deleting each word in the sentence with probability [math]\displaystyle{ p_C }[/math] and applying to the sentence a permutation randomly selected from those that do not move words more than [math]\displaystyle{ k_C }[/math] spots from their original positions. The authors select [math]\displaystyle{ p_D=0.1 }[/math] and [math]\displaystyle{ k_D=3 }[/math]. The corrupted sentence is then mapped to the latent space using the aligned word vectors[math]\displaystyle{ A }[/math] and the encoder [math]\displaystyle{ E }[/math]. The loss is then the negative log probability of the original uncorrupted sentence according to the decoder [math]\displaystyle{ D }[/math] applied to the latent space sequence.
The explanation of Vincent et al. (2008) can help us understand this loss-function term. The de-noising auto-encoder loss
References
- Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
- Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)
- Hill, Felix, Kyunghyun Cho, and Anna Korhonen. "Learning distributed representations of sentences from unlabelled data." arXiv preprint arXiv:1602.03483 (2016).
- Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).
- Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
- Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008.