Unsupervised Machine Translation Using Monolingual Corpora Only: Difference between revisions

From statwiki
Jump to navigation Jump to search
 
Line 92: Line 92:
==Training==
==Training==


The training is iterative, where the translation model <math>M^(t)</math>improves at each time step <math>t</math>. To seed the training, <math>M^(1)</math> is a translation from a different unsupervised word-by-word translation as proposed by [Conneau, 2017]. Each iteration of the training is as follows:
The training is iterative, where the translation model <math>M^{(t)}</math>improves at each time step <math>t</math>. To seed the training, <math>M^{(1)}</math> is a translation from a different unsupervised word-by-word translation as proposed by [Conneau, 2017]. Each iteration of the training is as follows:


# Use <math>M^(t)</math> to obtain a translation <math>M^(t)(x)</math>.
# Use <math>M^{(t)}</math> to obtain a translation <math>M^{(t)}(x)</math>.
# Use <math>M^(t)(x)</math> and <math>x</math> to train auto-encoder, training discriminator at the same time (i.e. minimizing final objective function)
# Use <math>M^{(t)}(x)</math> and <math>x</math> to train auto-encoder, training discriminator at the same time (i.e. minimizing final objective function)
# Update <math>M^(t+1)</math>, repeat.
# Update <math>M^{(t+1)}</math>, repeat.


**inser Algorithm1 here**
**inser Algorithm1 here**

Latest revision as of 22:48, 19 November 2018

Introduction

The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences.

The general approach of the methodology is to first use a unsupervised word-by-word translation model proposed by [Conneau, 2017], then iteratively improve on the translation by utilizing 2 architectures:

  1. A denoising auto-encoder to reconstruct noisy versions of sentences for both source and target languages.
  2. A discriminator to align the distributions of the source and target languages in a latent space.

Background

Methodology

The model uses a sequence to sequence model with attention, without input-feeding. Both the encoder and decoder are 3 layer LSTMs, and the encoder is bidirectional. The encoder and decoder are invariant to the language being used, as there is only 1 set of parameters for the encoder, and another set for the decoder.

The objective function that proposed by the paper is a combination of 3 component objective functions:

  1. Reconstruction loss of the denoising auto-encoder
  2. Cross domain translation loss of the auto-encoder
  3. Adversarial cross entropy loss of the discriminator

Notations

[math]\displaystyle{ \mathcal{W}_S, \mathcal{W}_T }[/math] are the sets of words in the source language domain.

[math]\displaystyle{ \mathcal{Z}^S , \mathcal{Z}^T }[/math] are the sets of word embeddings in the source and target language domain.

[math]\displaystyle{ \ell \in \{src, tgt\} }[/math] denote the source or target language

[math]\displaystyle{ x \in \mathbb{R}^m }[/math] is a vector of m words in a particular language [math]\displaystyle{ \ell }[/math]

[math]\displaystyle{ e_{\theta_{enc},\mathcal{Z}}(x, \ell) }[/math] is the encoder parameterized by [math]\displaystyle{ \theta_{enc} }[/math], it takes as input [math]\displaystyle{ x }[/math] and [math]\displaystyle{ \ell }[/math] and computes [math]\displaystyle{ z \in \mathbb{R}^m }[/math], which is a sequence of m hidden states using embedding [math]\displaystyle{ \mathcal{Z}^{\ell} }[/math]

[math]\displaystyle{ d_{\theta_{dec},\mathcal{Z}}(z, \ell) }[/math] is the decoder parameterized by [math]\displaystyle{ \theta_{dec} }[/math], it takes as input [math]\displaystyle{ z }[/math] and [math]\displaystyle{ \ell }[/math] and computes [math]\displaystyle{ y \in \mathbb{R}^k }[/math], which a sequence of k words from vocabulary [math]\displaystyle{ \mathcal{W}^{\ell} }[/math]

Noise Model

The Noise model used throughout the paper [math]\displaystyle{ C(x) }[/math] is a randomly sampled noisy version of sentence [math]\displaystyle{ x }[/math]. Noise is added in 2 ways:

  1. Randomly dropping each word in the sentence with probability [math]\displaystyle{ p_{wd} }[/math].
  2. Slightly shuffling the words in the sentence where each word can be at most [math]\displaystyle{ k }[/math] positions away from its original position.

The authors found in practice [math]\displaystyle{ p_{wd}= 0.1 }[/math] and [math]\displaystyle{ k=3 }[/math] to be good parameters.

Loss Component 1: Reconstruction Loss

This component captures the expected cross entropy loss between [math]\displaystyle{ x }[/math] and the reconstructed [math]\displaystyle{ \hat{x} }[/math], where [math]\displaystyle{ \hat{x} }[/math] is constructed as follows:

  1. Construct [math]\displaystyle{ C(x) }[/math], noisy version of [math]\displaystyle{ x }[/math] from a language [math]\displaystyle{ \ell }[/math]
  2. Input [math]\displaystyle{ C(x) }[/math] and language [math]\displaystyle{ \ell }[/math] into the encoder parameterized with [math]\displaystyle{ \theta_{enc} }[/math], to get [math]\displaystyle{ e(C(x),\ell) }[/math].
  3. Input the [math]\displaystyle{ e(C(x),\ell) }[/math] and [math]\displaystyle{ \ell }[/math] into the decoder parameterized with [math]\displaystyle{ \theta_{dec} }[/math], to get [math]\displaystyle{ \hat{x} \sim d(e(C(x),\ell),\ell) }[/math].

\begin{align} \mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell) = E_{x\sim D_\ell, \hat{x}\sim d(e(C(x),\ell),\ell)}[\Delta(\hat{x},x)] \end{align}

Loss Component 2: Cross Domain Translation Loss

This component captures the expected cross entropy loss between [math]\displaystyle{ x }[/math] and the reconstructed [math]\displaystyle{ \hat{x} }[/math] from the translation of [math]\displaystyle{ x }[/math], where [math]\displaystyle{ \hat{x} }[/math] is constructed as follows:

  1. Using the current iteration of the translation model [math]\displaystyle{ M }[/math], construct translation [math]\displaystyle{ M(x) }[/math] in [math]\displaystyle{ \ell_2 }[/math], where [math]\displaystyle{ x }[/math] is from a language [math]\displaystyle{ \ell_1 }[/math]. (Initialization of M is using a different translation model discussed later)
  2. Construct [math]\displaystyle{ C(M(x)) }[/math], noisy version of translation [math]\displaystyle{ M(x) }[/math].
  3. Input [math]\displaystyle{ C(M(x)) }[/math] and language [math]\displaystyle{ \ell_2 }[/math] into the encoder parameterized with [math]\displaystyle{ \theta_{enc} }[/math], to get [math]\displaystyle{ e(C(M(x)),\ell_2) }[/math].
  4. Input [math]\displaystyle{ e(C(M(x)),\ell_2) }[/math] and [math]\displaystyle{ \ell_1 }[/math] into the decoder parameterized with [math]\displaystyle{ \theta_{dec} }[/math], to get [math]\displaystyle{ \hat{x} \sim d(e(C(M(x)),\ell_2),\ell_1) }[/math].

\begin{align} \mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell_1,\ell_2) = E_{x\sim D_{\ell_1}, \hat{x}\sim d(e(C(M(x)),\ell_2),\ell_1)}[\Delta(\hat{x},x)] \end{align}

Loss Component 3: Adversarial Loss

A discriminator parameterized with [math]\displaystyle{ \theta_D }[/math] is trained to to distinguish the language [math]\displaystyle{ \ell }[/math] given a vector [math]\displaystyle{ z }[/math] in the latent space. It is trained by minimizing the cross entropy loss [math]\displaystyle{ \mathcal{L}_D }[/math] of the predicted language and the ground truth language, given the language produced the vector [math]\displaystyle{ z }[/math].

The enconder is trained to fool the discriminator, and the adversarial loss is minimized when given an encoding of [math]\displaystyle{ x }[/math] in language [math]\displaystyle{ \ell_i }[/math], the discriminator predicts that it comes from [math]\displaystyle{ \ell_j }[/math].

The end result at convergence is that the representation in the latent space for language [math]\displaystyle{ \ell_1 }[/math] is indistinguishable from language [math]\displaystyle{ \ell_2 }[/math].

\begin{align} \mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D) = -E_{x_i,\ell_i}[log p_D (\ell_j|e(x_i,\ell_i))] \end{align} with [math]\displaystyle{ \ell_j=\ell_1 }[/math] if [math]\displaystyle{ \ell_i=\ell_2 }[/math], and vice versa.

Final Objective Loss Function

Combining all 3 components the following objective function is obtained:

\begin{align*} \mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}) &= \lambda_{auto}[\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src)+\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt)]\\ &+ \lambda_{cd}[\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src,tgt) +\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt,src)]\\ &+\lambda_{adv}\mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D) \end{align*}

[math]\displaystyle{ \lambda_{auto} }[/math], [math]\displaystyle{ \lambda_{cd} }[/math], and [math]\displaystyle{ \lambda_{adv} }[/math] are hyperparameters that represent the weights of each component. The discriminator loss [math]\displaystyle{ \mathcal{L}_D }[/math] is minimized in parallel, because its parameter [math]\displaystyle{ \theta_{D} }[/math] is used in the last component.

    • insert Figure2 here**

Training

The training is iterative, where the translation model [math]\displaystyle{ M^{(t)} }[/math]improves at each time step [math]\displaystyle{ t }[/math]. To seed the training, [math]\displaystyle{ M^{(1)} }[/math] is a translation from a different unsupervised word-by-word translation as proposed by [Conneau, 2017]. Each iteration of the training is as follows:

  1. Use [math]\displaystyle{ M^{(t)} }[/math] to obtain a translation [math]\displaystyle{ M^{(t)}(x) }[/math].
  2. Use [math]\displaystyle{ M^{(t)}(x) }[/math] and [math]\displaystyle{ x }[/math] to train auto-encoder, training discriminator at the same time (i.e. minimizing final objective function)
  3. Update [math]\displaystyle{ M^{(t+1)} }[/math], repeat.
    • inser Algorithm1 here**

Model Selection Criterion

In Machine Translation, the Bilingual Evaluation Understudy(BLEU) Score is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation. However since the training is unsupervised without parallel copora, BLEU cannot be used during training to select hyper-parameters.

The paper proposes a scoring method that correlates with the BLEU. The main idea is to assess BLEU score between [math]\displaystyle{ x }[/math] and the back-translated version using the model(i.e. translate [math]\displaystyle{ x }[/math] to the target language then translate it back to source language). With this it is possible to score the quality of the translation model without supervision.

Results

Critique


Other Sources

References

  1. [Lample, 2018] Lample, G., Conneau, A., Ranzato, M., Denoyer, L., "Unsupervised Machine Translation Using Monolingual Corpora Only". arXiv:1711.00043
  1. [Conneau, 2017] Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H., "Word Translation without Parallel Data". arXiv:1710.04087