Unsupervised Neural Machine Translation: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 24: Line 24:
= Methodology =
= Methodology =


The model uses a sequence to sequence model with attention, without input-feeding. Both the encoder and decoder are 3 layer LSTMs, and the encoder is bidirectional. The encoder and decoder are invariant to the language being used, as there is only 1 set of parameters for the encoder, and another set for the decoder.
The corpora data is first processed in a standard way to tokenize and compress the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.


The objective function that proposed by the paper is a combination of 3 component objective functions:
The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN.  All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.
# Reconstruction loss of the denoising auto-encoder
# Cross domain translation loss of the auto-encoder
# Adversarial cross entropy loss of the discriminator


====Notations====
*insert Figure1*
<math>\mathcal{W}_S, \mathcal{W}_T </math> are the sets of words in the source language domain.


<math>\mathcal{Z}^S , \mathcal{Z}^T </math> are the sets of word embeddings in the source and target language domain.
The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.


<math>\ell \in \{src, tgt\} </math> denote the source or target language
===Denoising===


<math>x \in \mathbb{R}^m</math> is a vector of m words in a particular language <math>\ell</math>
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.


<math>e_{\theta_{enc},\mathcal{Z}}(x, \ell)</math> is the encoder parameterized by <math>\theta_{enc}</math>, it takes as input <math>x</math> and <math>\ell</math> and computes <math>z \in \mathbb{R}^m</math>, which is a sequence of m hidden states using embedding <math>\mathcal{Z}^{\ell} </math>
Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:


<math>d_{\theta_{dec},\mathcal{Z}}(z, \ell)</math> is the decoder parameterized by <math>\theta_{dec}</math>, it takes as input <math>z</math> and <math>\ell</math> and computes <math>y \in \mathbb{R}^k</math>, which a sequence of k words from vocabulary <math>\mathcal{W}^{\ell}</math>
# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.


===Noise Model===
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.


The Noise model used throughout the paper <math>C(x)</math> is a randomly sampled noisy version of sentence <math>x</math>. Noise is added in 2 ways:
The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.
# Randomly dropping each word in the sentence with probability <math>p_{wd}</math>.
# Slightly shuffling the words in the sentence where each word can be at most <math>k</math> positions away from its original position.


The authors found in practice <math>p_{wd}= 0.1 </math> and <math>k=3</math> to be good parameters.
===Back-Translation===


===Loss Component 1: Reconstruction Loss===
With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:


This component captures the expected cross entropy loss between <math>x</math> and the reconstructed <math>\hat{x}</math>, where <math>\hat{x}</math> is constructed as follows:
# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(x)</math>, noisy version of <math>x</math> from a language <math>\ell</math>
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(x)</math> and language <math>\ell</math> into the encoder parameterized with <math>\theta_{enc}</math>, to get <math>e(C(x),\ell)</math>.
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.
# Input the <math>e(C(x),\ell)</math> and <math>\ell</math> into the decoder parameterized with <math>\theta_{dec}</math>, to get <math>\hat{x} \sim d(e(C(x),\ell),\ell)</math>.


\begin{align}
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.
\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell) = E_{x\sim D_\ell, \hat{x}\sim d(e(C(x),\ell),\ell)}[\Delta(\hat{x},x)]
\end{align}


===Loss Component 2: Cross Domain Translation Loss===
===Training===


This component captures the expected cross entropy loss between <math>x</math> and the reconstructed <math>\hat{x}</math> from the translation of <math>x</math>, where <math>\hat{x}</math> is constructed as follows:
Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
# Using the current iteration of the translation model <math>M</math>, construct translation <math>M(x)</math> in <math>\ell_2</math>, where <math>x</math> is from a language <math>\ell_1</math>. (Initialization of M is using a different translation model discussed later)
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.
# Construct <math>C(M(x))</math>, noisy version of translation <math>M(x)</math>.
# Input <math>C(M(x))</math> and language <math>\ell_2</math> into the encoder parameterized with <math>\theta_{enc}</math>, to get <math>e(C(M(x)),\ell_2)</math>.
# Input <math>e(C(M(x)),\ell_2)</math> and <math>\ell_1</math> into the decoder parameterized with <math>\theta_{dec}</math>, to get <math>\hat{x} \sim d(e(C(M(x)),\ell_2),\ell_1)</math>.


\begin{align}
Optimizer choice and other hyperparameters can be found in the paper.
\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell_1,\ell_2) = E_{x\sim D_{\ell_1}, \hat{x}\sim d(e(C(M(x)),\ell_2),\ell_1)}[\Delta(\hat{x},x)]
\end{align}


===Loss Component 3: Adversarial Loss===
=Results=
 
A discriminator parameterized with <math>\theta_D</math> is trained to to distinguish the language <math>\ell</math> given a vector <math>z</math> in the latent space. It is trained by minimizing the cross entropy loss <math>\mathcal{L}_D</math> of the predicted language and the ground truth language, given the language produced the vector <math>z</math>.
 
The enconder is trained to fool the discriminator, and the adversarial loss is minimized when given an encoding of <math>x</math> in language <math>\ell_i</math>, the discriminator predicts that it comes from <math>\ell_j</math>.
 
The end result at convergence is that the representation in the latent space for language <math>\ell_1</math> is indistinguishable from language <math>\ell_2</math>.
 
\begin{align}
\mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D) = -E_{x_i,\ell_i}[log p_D (\ell_j|e(x_i,\ell_i))]
\end{align}
with <math>\ell_j=\ell_1</math> if <math>\ell_i=\ell_2</math>, and vice versa.
 
==Final Objective Loss Function==
 
Combining all 3 components the following objective function is obtained:
 
\begin{align*}
\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}) &= \lambda_{auto}[\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src)+\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt)]\\
&+ \lambda_{cd}[\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src,tgt) +\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt,src)]\\
&+\lambda_{adv}\mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D)
\end{align*}
 
<math>\lambda_{auto}</math>, <math>\lambda_{cd}</math>, and <math>\lambda_{adv}</math> are hyperparameters that represent the weights of each component. The discriminator loss <math>\mathcal{L}_D</math> is minimized in parallel, because its parameter <math>\theta_{D}</math> is used in the last component.
 
**insert Figure2 here**
 
==Training==
 
The training is iterative, where the translation model <math>M^{(t)}</math>improves at each time step <math>t</math>. To seed the training, <math>M^{(1)}</math> is a translation from a different unsupervised word-by-word translation as proposed by [Conneau, 2017]. Each iteration of the training is as follows:
 
# Use <math>M^{(t)}</math> to obtain a translation <math>M^{(t)}(x)</math>.
# Use <math>M^{(t)}(x)</math> and <math>x</math> to train auto-encoder, training discriminator at the same time (i.e. minimizing final objective function)
# Update <math>M^{(t+1)}</math>, repeat.
 
**inser Algorithm1 here**
 
==Model Selection Criterion==
 
In Machine Translation, the Bilingual Evaluation Understudy(BLEU) Score is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation. However since the training is unsupervised without parallel copora, BLEU cannot be used during training to select hyper-parameters.
 
The paper proposes a scoring method that correlates with the BLEU. The main idea is to assess BLEU score between <math> x </math> and the back-translated version using the model(i.e. translate <math> x </math> to the target language then translate it back to source language). With this it is possible to score the quality of the translation model without supervision.


=Results=
=Results=

Revision as of 15:52, 20 November 2018

Introduction

The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

  1. Using monolingual corpora in the source and target languages to learn source and target word embeddings.
  2. Align the 2 sets of word embeddings in the same latent space.

Then iteratively perform:

  1. Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
  2. Tune the decoder in each language by back-translating between the source and target language.

Background

Word Embedding Alignment

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

  • insert Figure2

The paper uses the methodology proposed by [Artetxe, 2018] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

Methodology

The corpora data is first processed in a standard way to tokenize and compress the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

  • insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

Denoising

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if [math]\displaystyle{ x }[/math] is a sentence in language L1:

  1. Construct [math]\displaystyle{ C(x) }[/math], noisy version of [math]\displaystyle{ x }[/math],
  2. Input [math]\displaystyle{ C(x) }[/math] into the current iteration of the shared encoder and use decoder for L1 to get reconstructed [math]\displaystyle{ \hat{x} }[/math].

The training objective is to minimize the cross entropy loss between [math]\displaystyle{ {x} }[/math] and [math]\displaystyle{ \hat{x} }[/math].

The proposed noise function is to perform [math]\displaystyle{ N/2 }[/math] random swaps of words that are near each other, where [math]\displaystyle{ N }[/math] is the number of words in the sentence.

Back-Translation

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if [math]\displaystyle{ C(x) }[/math] is a noisy version of sentence [math]\displaystyle{ x }[/math] in language L1:

  1. Input [math]\displaystyle{ C(x) }[/math] into the current iteration of shared encoder and the decoder in L2 to construct translation [math]\displaystyle{ y }[/math] in L1,
  2. Construct [math]\displaystyle{ C(y) }[/math], noisy version of translation [math]\displaystyle{ y }[/math],
  3. Input [math]\displaystyle{ C(y) }[/math] into the current iteration of shared encoder and the decoder in L1 to reconstruct [math]\displaystyle{ \hat{x} }[/math] in L1.

The training objective is to minimize the cross entropy loss between [math]\displaystyle{ {x} }[/math] and [math]\displaystyle{ \hat{x} }[/math].

Training

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence. During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

Results

Results

Critique


Other Sources

References

  1. [Lample, 2018] Lample, G., Conneau, A., Ranzato, M., Denoyer, L., "Unsupervised Machine Translation Using Monolingual Corpora Only". arXiv:1711.00043
  1. [Conneau, 2017] Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H., "Word Translation without Parallel Data". arXiv:1710.04087