stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only
Introduction
Neural machine translation systems must be trained on large corpora consisting of pairs of pre-translated sentences. This paper proposes an unsupervised neural machine translation system, which can be trained without using any such parallel data.
Overview of Standard Neural Machine Translation
Let [math]\displaystyle{ S }[/math] denote the set of words in the source language, and let [math]\displaystyle{ T }[/math] denote the set of words in the target language. Let [math]\displaystyle{ \mathcal{Z}_S \subset \mathbb{R}^{n_S} }[/math] and [math]\displaystyle{ \mathcal{Z}_T \subset \mathbb{R}^{n_T} }[/math] denote the word vectors corresponding to the words of the source and target language respectively. Let
\begin{align} S'&:=\bigcup_{M=1}^\infty S^M\\ T'&:=\bigcup_{K=1}^\infty T^K\\ \mathcal{Z}_S'&:=\bigcup_{M=1}^\infty \mathcal{Z}_S^M\\ \mathcal{Z}_T'&:=\bigcup_{K=1}^\infty \mathcal{Z}_T^K. \end{align}
In other words, [math]\displaystyle{ S' }[/math] and [math]\displaystyle{ T' }[/math] are the sets of finite sequences of words from the source and target language, and [math]\displaystyle{ \mathcal{Z}_S' }[/math] and [math]\displaystyle{ Z_T' }[/math] are the sets of finite sequences of corresponding word vectors. For any set [math]\displaystyle{ Q }[/math], we let [math]\displaystyle{ \mathbb{P}(S) }[/math] denote the set of probability distributions over [math]\displaystyle{ Q }[/math] (eliding measure-theoretic details). In standard neural machine translation system (Sutskever, 2014) the encoder [math]\displaystyle{ E:S' \to \mathbb{R}^{n_H} }[/math] maps a sequence of source words a code vector, while the decoder, [math]\displaystyle{ D: \mathbb{R}^{n_H} \to \mathbb{P}(T') }[/math] maps a code vector to a probability distribution over sequences of target-language words. Typically [math]\displaystyle{ E }[/math] is a mono- or bi- directional LSTM; the code vector is the final hidden state[math]\displaystyle{ h_M }[/math](or the concatenation of the two final hidden states); and the decoder is a monodirectional LSTM or GRU that at every step [math]\displaystyle{ k }[/math] accepts as input a hidden state [math]\displaystyle{ l_k }[/math] and the previous word in the sequence [math]\displaystyle{ y_{k-1} }[/math] and outputs a member of [math]\displaystyle{ \mathbb{P}(T) }[/math], a probability distribution over the possible next words in the sequence. The probability that the decoder assigns to a given target sequence [math]\displaystyle{ (y_1, \ldots ,y_K) \in T' }[/math] given a source sequence [math]\displaystyle{ (x_1,\ldots, x_M) }[/math] is computed by initializing the decoder with hidden state [math]\displaystyle{ E(x_1,\ldots,x_M)=c\in \mathbb{R}^{n_H} }[/math] and a special start-of-string token. At every subsequent step, the RNN is fed the hidden state output in the previous step as well the previous word of the sentence whose probability is being calculated. The probability of the sequence is then the product of the probabilities of its constituent words, where every sequence must terminates with a special end-of-string token.
To train the model, we requires a large corpus of parallels sentences [math]\displaystyle{ \{x^r \}_{r=1}^R \subset S' }[/math] and [math]\displaystyle{ \{y^r \}_{r=1}^R \subset T' }[/math] where [math]\displaystyle{ x^r=(x^r_1,\ldots, x^r_{M(r)}) }[/math] and [math]\displaystyle{ y^r=(y^r_1,\ldots, y^r_{K(r)}) }[/math]. We apply stochastic gradient descent, using as our loss function the negative log probability of each target language sentence given the matching source language sentence: \begin{align} \mathcal{L}( \{ x^r\}_{r=1}^R , \{ y^r\}_{r=1}^R )= \sum_{r=1}^R - \log D(E(x^r))(y^r) \end{align}
Attention
The above-described translation model performs poorly on long sentences. This is supposed to be a consequence of of the difficulty of compressing an arbitrarily-long sequence into a fixed-length code vector [math]\displaystyle{ c }[/math], and of the dilution of the impact of the code vector [math]\displaystyle{ c }[/math] as the decoder RNN gets further from it. To resolve these problems, Bahdanau et al. (2014) introduced attention. In summarize their paper below.
In a neural machine translation system with attention, the fixed-length code vector [math]\displaystyle{ c_r }[/math] is replaced by [math]\displaystyle{ (h_1,\ldots,h_M) \subset \mathbb{R}^{M\times n_{H_s}} }[/math], the concatenation for the hidden states produced by the encoder RNN. At time step [math]\displaystyle{ k }[/math] in addition to accepting the previous hidden state [math]\displaystyle{ l_k }[/math] and previous word [math]\displaystyle{ y_{k-1} }[/math] as input, the decoder also accepts a context vector [math]\displaystyle{ u_k \in \mathbb{R}^{n_{H_s}} }[/math] where
\begin{align} u_k= \sum_{m=1}^M \alpha_{k,m} h_m \end{align}
is a weighted sum of the hidden states produced by the encoder on the input sentence, with
\begin{align} \alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'}) },\\ e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m ), \end{align} where [math]\displaystyle{ W,U }[/math] are learnable weight matrices and [math]\displaystyle{ v }[/math] is a learnable weight vector. The idea is that the model can learn to decide which hidden states from source sentences are important for computing which words in the target sentences. Empirically, the addition of attention significantly improves neural machine translation
Overview of unsupervised translation system
The unsupervised translation system the following plan:
- Sentences from both the source and target language are mapped to a common latent vector space.
- A de-noising auto-encoder loss encourages the latent space representations of sentences to be insensitive to noise.
- An adverbial loss encourages the latent space representations of source and target sentences to be indistinguishable from each other. The idea is that the latent space representations should reflect the meaning of the sentences, and not the particular language in which it is expressed.
- A reconstruct loss works as follows: sample a sentence from one of the languages, and apply the translation model of the previous epoch to translate it to the other language. Then corrupt this translation with noise. The reconstruction loss encourages the model to able to recover the original sampled sentence from its corrupted translation by passing through the latent vector space.
- An unsupervised word-vector alignment system
- An encoder
- A decoder
- A discriminator
Word vector alignment
Methods like word2vec and GLOVE generate vectors in Euclidian space whose geometry corresponds to the semantics of the words they represent. For example, if f maps English words to vectors, then we have
[math]\displaystyle{ f(\text{king}) -f(\text{man}) +f(\text{woman}) \approx f(\text{queen}) }[/math]
Mikolov (2013), observed that this correspondence between semantics and geometry is invariant across languages. For example, if g maps French words to their corresponding vectors, then
[math]\displaystyle{ g(\text{roi}) -g(\text{homme}) +g(\text{femme}) \approx g(\text{reine}) }[/math]
It follows that if A maps English word vectors to the word vectors of their French translations, we should expect A to be linear.
Mikolov and many subsequent authors used this observation to devise methods for expanding small bilingual dictionaries. The main idea is that using given a small amount of parallel data, we can solve for the linear transformation A by least squares. We can then use this learned linear transformation to translate arbitrary words, assuming we have the corresponding word vectors. See my CS698 project for details (link). Essentially these schemes can be though of as methods for aligning the point cloud of source-language word-vectors with the point cloud of target language word-vectors. This work culminated recently with the paper of Conneau et al. (2017), which uses a GAN to align the word vectors of a source and target language using only considerations of point cloud geometry, and without relying on any bilingual dictionary. The present paper uses this unsupervised cross-lingual word-vector alignment scheme for two purposes: a) to initialize the unsupervised translation scheme; and b) to embed the word vectors of the two languages in the same space, so that at the same encoder and decoder can be applied to both languages.
Encoder
The encoder accepts as input the sequence of word vectors corresponding to the sentence of words to be translated. Although the lookup table matching words to word vectors differs between languages, the encoder itself is identical: the same encoder is used to process a sequence of word vectors from the target language and from the source language. This is possible because in the previous step, the word vectors of the two languages were aligned, so that word vectors correspond to words with similar meanings are nearby in Euclidian space, regardless of their language.
The encoder consists of two LSTMs. One reads the sentence in the forward direction, while the other reads it in the reverse direction. At each word, each LSTM produces an output vector. These two output vectors are concatenated. Thus the output of the encoder is a sequence of state vectors [math]\displaystyle{ z=(z_1,\ldots, z_m) }[/math]. one for each word in the input sentence.
Decoder
The decoder accepts as input the vector [math]\displaystyle{ z }[/math] produced by the encoder, as well as a language-specific start-of-sentence symbol, and produces a sequence of words terminated by a stop symbol. The decoder is a standard LSTM, except that it uses an attention mechanism, which I describe in the next section.
Attention Mechanisms
Bahdanau et al. (2014) introduced the attention mechanism
Overview of objective
The objective function is the sum of three terms:
- The de-noising auto-encoder loss
- The translation loss
- The adversarial loss
References
- Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
- Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)
- Tomas Mikolov, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).
- Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.