Unsupervised Machine Translation Using Monolingual Corpora Only: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "= Introduction = Over the past few years, = Methodology = The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some...")
 
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Introduction =
= Introduction =
Over the past few years,
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences.


= Methodology =
The general approach of the methodology is to first use a unsupervised word-by-word translation model proposed by [Conneau, 2017], then iteratively improve on the translation by utilizing 2 architectures:


The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.
# A denoising auto-encoder to reconstruct noisy versions of sentences for both source and target languages.
# A discriminator to align the distributions of the source and target languages in a latent space.


== Background ==
= Background =


==== Adversarial Images Mathematically ====
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:
# <math>D(x,x') < \epsilon </math>
# <math>c(x') \neq c^*(x) </math>


Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.
= Methodology =
 
==== Adversarial Attacks Terminology ====
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only,  while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize loss for a target class'' <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math>  is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.
 
In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.
 
== Obfuscated Gradients ==
As gradients are used in the generation of white-box adversarial images, many defense strategies have focused on methods that mask gradients.  If gradients are masked, they cannot be followed to generate adversarial images. The authors argue against this general approach by showing that it can be easily circumvented. To emphasize their point, they looked at white-box defenses proposed in ICLR 2018. Three types of gradient masking techniques were found:
 
# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced  by using operations where following the gradient doesn't maximize classification loss globally.
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself.
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.


== The Attacks ==
The model uses a sequence to sequence model with attention, without input-feeding. Both the encoder and decoder are 3 layer LSTMs, and the encoder is bidirectional. The encoder and decoder are invariant to the language being used, as there is only 1 set of parameters for the encoder, and another set for the decoder.
To circumvent these gradient masking techniques, the authors propose:
# '''Backward Pass Differentiable Approximation (BPDA)''': For defenses that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation.
# '''Expectation over Transformation [Athalye, 2017]''': For defenses that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.


= Main Results =
The objective function that proposed by the paper is a combination of 3 component objective functions:
[[File:Summary_Table.png|600px|center]]
# Reconstruction loss of the denoising auto-encoder
# Cross domain translation loss of the auto-encoder
# Adversarial cross entropy loss of the discriminator


The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defense targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images.  
====Notations====
<math>\mathcal{W}_S, \mathcal{W}_T </math> are the sets of words in the source language domain.


Standard models were used in evaluating the accuracy of defense strategies under the attacks,
<math>\mathcal{Z}^S , \mathcal{Z}^T </math> are the sets of word embeddings in the source and target language domain.
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)
# Imagenet: InceptionV3 (78.0% top-1 accuracy)


The last column shows the accuracies each defense method achieved over the adversarial test set. Except for [Madry, 2018], all defense methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically the defense can be circumvented using their proposed method.
<math>\ell \in \{src, tgt\} </math> denote the source or target language


==== The defense that worked - Adversarial Training [Madary, 2018] ====
<math>x \in \mathbb{R}^m</math> is a vector of m words in a particular language <math>\ell</math>


As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:
<math>e_{\theta_{enc},\mathcal{Z}}(x, \ell)</math> is the encoder parameterized by <math>\theta_{enc}</math>, it takes as input <math>x</math> and <math>\ell</math> and computes <math>z \in \mathbb{R}^m</math>, which is a sequence of m hidden states using embedding <math>\mathcal{Z}^{\ell} </math>
\begin{align}
\theta^* = \arg \underset{\theta} \min  \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]}  
\end{align}


where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>.
<math>d_{\theta_{dec},\mathcal{Z}}(z, \ell)</math> is the decoder parameterized by <math>\theta_{dec}</math>, it takes as input <math>z</math> and <math>\ell</math> and computes <math>y \in \mathbb{R}^k</math>, which a sequence of k words from vocabulary <math>\mathcal{W}^{\ell}</math>


Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.
===Noise Model===


This approach was shown to provide resilience to all types of adversarial attacks.
The Noise model used throughout the paper <math>C(x)</math> is a randomly sampled noisy version of sentence <math>x</math>. Noise is added in 2 ways:
# Randomly dropping each word in the sentence with probability <math>p_{wd}</math>.
# Slightly shuffling the words in the sentence where each word can be at most <math>k</math> positions away from its original position.


==== How to check for Obfuscated Gradients ====
The authors found in practice <math>p_{wd}= 0.1 </math> and <math>k=3</math> to be good parameters.
For future defense proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if defense is relying on masked gradients:
# If weaker one-step attacks are performing better than iterative attacks.
# Black-box attacks can find stronger adversarial images compared with white-box attacks.
# Unbounded iterative attacks do not reach 100% success.
# If random brute force attempts are better than gradient based methods at finding adversarial images.


==== Recommendations for future defense methods to encourage reproducibility ====
===Loss Component 1: Reconstruction Loss===


= Detailed Results =
This component captures the expected cross entropy loss between <math>x</math> and the reconstructed <math>\hat{x}</math>, where <math>\hat{x}</math> is constructed as follows:
# Construct <math>C(x)</math>, noisy version of <math>x</math> from a language <math>\ell</math>
# Input <math>C(x)</math> and language <math>\ell</math> into the encoder parameterized with <math>\theta_{enc}</math>, to get <math>e(C(x),\ell)</math>.
# Input the <math>e(C(x),\ell)</math> and <math>\ell</math> into the decoder parameterized with <math>\theta_{dec}</math>, to get <math>\hat{x} \sim d(e(C(x),\ell),\ell)</math>.


==  Non-obfuscated Gradients ==
==== Cascade Adversarial Training, [Na, 2018] ====
'''Defense''': Since to the method of [Madry, 2018], the authors of [Na, 2018] propose a new training method. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.
'''Attack''': The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared  with [Madry, 2018] under the same perturbation setting.
== Gradient Shattering ==
==== Thermometer Coding, [Buckman, 2018] ====
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,
Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:
\begin{align}
\begin{align}
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}
\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell) = E_{x\sim D_\ell, \hat{x}\sim d(e(C(x),\ell),\ell)}[\Delta(\hat{x},x)]
1 \space if  \thinspace x_{i,j,c} >k/l \\
0 \space otherwise \\
\end{array}
\end{align}
\end{align}
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is essentially like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.


On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.
===Loss Component 2: Cross Domain Translation Loss===
 
This component captures the expected cross entropy loss between <math>x</math> and the reconstructed <math>\hat{x}</math> from the translation of <math>x</math>, where <math>\hat{x}</math> is constructed as follows:
# Using the current iteration of the translation model <math>M</math>, construct translation <math>M(x)</math> in <math>\ell_2</math>, where <math>x</math> is from a language <math>\ell_1</math>. (Initialization of M is using a different translation model discussed later)
# Construct <math>C(M(x))</math>, noisy version of translation <math>M(x)</math>.
# Input <math>C(M(x))</math> and language <math>\ell_2</math> into the encoder parameterized with <math>\theta_{enc}</math>, to get <math>e(C(M(x)),\ell_2)</math>.
# Input <math>e(C(M(x)),\ell_2)</math> and <math>\ell_1</math> into the decoder parameterized with <math>\theta_{dec}</math>, to get <math>\hat{x} \sim d(e(C(M(x)),\ell_2),\ell_1)</math>.


'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,
\begin{align}
\begin{align}
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )
\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell_1,\ell_2) = E_{x\sim D_{\ell_1}, \hat{x}\sim d(e(C(M(x)),\ell_2),\ell_1)}[\Delta(\hat{x},x)]
\end{align}
\end{align}
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.


==== Input Transformation, [Guo, 2018] ====
===Loss Component 3: Adversarial Loss===
'''Defense''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high frequency noise while preserving legitimate edges (good high frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the data base.


Both techniques, removed perturbations from adversarial images which provides some robustness to adversarial attacks. Moreover, both approaches are non-differentiable which makes constructing white-box adversarial images difficult. Moreover, the techniques also include test time randomness as the modifications made are input dependent. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations.
A discriminator parameterized with <math>\theta_D</math> is trained to to distinguish the language <math>\ell</math> given a vector <math>z</math> in the latent space. It is trained by minimizing the cross entropy loss <math>\mathcal{L}_D</math> of the predicted language and the ground truth language, given the language produced the vector <math>z</math>.


'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.
The enconder is trained to fool the discriminator, and the adversarial loss is minimized when given an encoding of <math>x</math> in language <math>\ell_i</math>, the discriminator predicts that it comes from <math>\ell_j</math>.  


==== Local Intrinsic Dimensionality, [Ma, 2018] ====
The end result at convergence is that the representation in the latent space for language <math>\ell_1</math> is indistinguishable from language <math>\ell_2</math>.
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be  <math>r_i(x)</math>, then the LID under the choose distance metric is given by,


\begin{align}
\begin{align}
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}
\mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D) = -E_{x_i,\ell_i}[log p_D (\ell_j|e(x_i,\ell_i))]
\end{align}
\end{align}
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k.  
with <math>\ell_j=\ell_1</math> if <math>\ell_i=\ell_2</math>, and vice versa.


First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defense method, but a method to study the properties of adversarial images.
==Final Objective Loss Function==


'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.
Combining all 3 components the following objective function is obtained:


== Stochastic Gradients ==
\begin{align*}
\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}) &= \lambda_{auto}[\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src)+\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt)]\\
&+ \lambda_{cd}[\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src,tgt) +\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt,src)]\\
&+\lambda_{adv}\mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D)
\end{align*}


==== Stochastic Activation Pruning, [Dhillon, 2018] ====
<math>\lambda_{auto}</math>, <math>\lambda_{cd}</math>, and <math>\lambda_{adv}</math> are hyperparameters that represent the weights of each component. The discriminator loss <math>\mathcal{L}_D</math> is minimized in parallel, because its parameter <math>\theta_{D}</math> is used in the last component.
'''Defense''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.


'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack they were able to reduce the accuracy of this method down to 0% on CIFAR-10.
**insert Figure2 here**


==== Mitigation Through Randomization, [Xie, 2018] ====
==Training==
'''Defense''':  [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.


'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.
The training is iterative, where the translation model <math>M^{(t)}</math>improves at each time step <math>t</math>. To seed the training, <math>M^{(1)}</math> is a translation from a different unsupervised word-by-word translation as proposed by [Conneau, 2017]. Each iteration of the training is as follows:


== Vanishing and Exploding Gradients ==
# Use <math>M^{(t)}</math> to obtain a translation <math>M^{(t)}(x)</math>.
# Use <math>M^{(t)}(x)</math> and <math>x</math> to train auto-encoder, training discriminator at the same time (i.e. minimizing final objective function)
# Update <math>M^{(t+1)}</math>, repeat.


==== Pixel Defend, [Song, 2018] ====
**inser Algorithm1 here**
'''Defense''':


The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient all the way to the input would not be possible because of the vanishing/exploding gradient
==Model Selection Criterion==
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with  <math>l_{\infty} (\epsilon=0.031) </math> perturbations.


'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this apporach, the authors were able to bring  down the accuracy to 9% under the same kind of perturbations.
In Machine Translation, the Bilingual Evaluation Understudy(BLEU) Score is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation. However since the training is unsupervised without parallel copora, BLEU cannot be used during training to select hyper-parameters.  


==== Defense-GAN, [Samangouei, 2018] ====
The paper proposes a scoring method that correlates with the BLEU. The main idea is to assess BLEU score between <math> x </math> and the back-translated version using the model(i.e. translate <math> x </math> to the target language then translate it back to source language). With this it is possible to score the quality of the translation model without supervision.


= Conclusion =
=Results=
In this paper,


= Critique =
= Critique =
# The third attack method,
#  




= Other Sources =
= Other Sources =
# Their re-implementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].
#  


= References =
= References =
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
#'''[Lample, 2018]''' Lample, G., Conneau, A., Ranzato, M., Denoyer, L., "Unsupervised Machine Translation Using Monolingual Corpora Only". arXiv:1711.00043
 
#'''[Conneau, 2017]''' Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H., "Word Translation without Parallel Data". arXiv:1710.04087

Latest revision as of 22:48, 19 November 2018

Introduction

The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences.

The general approach of the methodology is to first use a unsupervised word-by-word translation model proposed by [Conneau, 2017], then iteratively improve on the translation by utilizing 2 architectures:

  1. A denoising auto-encoder to reconstruct noisy versions of sentences for both source and target languages.
  2. A discriminator to align the distributions of the source and target languages in a latent space.

Background

Methodology

The model uses a sequence to sequence model with attention, without input-feeding. Both the encoder and decoder are 3 layer LSTMs, and the encoder is bidirectional. The encoder and decoder are invariant to the language being used, as there is only 1 set of parameters for the encoder, and another set for the decoder.

The objective function that proposed by the paper is a combination of 3 component objective functions:

  1. Reconstruction loss of the denoising auto-encoder
  2. Cross domain translation loss of the auto-encoder
  3. Adversarial cross entropy loss of the discriminator

Notations

[math]\displaystyle{ \mathcal{W}_S, \mathcal{W}_T }[/math] are the sets of words in the source language domain.

[math]\displaystyle{ \mathcal{Z}^S , \mathcal{Z}^T }[/math] are the sets of word embeddings in the source and target language domain.

[math]\displaystyle{ \ell \in \{src, tgt\} }[/math] denote the source or target language

[math]\displaystyle{ x \in \mathbb{R}^m }[/math] is a vector of m words in a particular language [math]\displaystyle{ \ell }[/math]

[math]\displaystyle{ e_{\theta_{enc},\mathcal{Z}}(x, \ell) }[/math] is the encoder parameterized by [math]\displaystyle{ \theta_{enc} }[/math], it takes as input [math]\displaystyle{ x }[/math] and [math]\displaystyle{ \ell }[/math] and computes [math]\displaystyle{ z \in \mathbb{R}^m }[/math], which is a sequence of m hidden states using embedding [math]\displaystyle{ \mathcal{Z}^{\ell} }[/math]

[math]\displaystyle{ d_{\theta_{dec},\mathcal{Z}}(z, \ell) }[/math] is the decoder parameterized by [math]\displaystyle{ \theta_{dec} }[/math], it takes as input [math]\displaystyle{ z }[/math] and [math]\displaystyle{ \ell }[/math] and computes [math]\displaystyle{ y \in \mathbb{R}^k }[/math], which a sequence of k words from vocabulary [math]\displaystyle{ \mathcal{W}^{\ell} }[/math]

Noise Model

The Noise model used throughout the paper [math]\displaystyle{ C(x) }[/math] is a randomly sampled noisy version of sentence [math]\displaystyle{ x }[/math]. Noise is added in 2 ways:

  1. Randomly dropping each word in the sentence with probability [math]\displaystyle{ p_{wd} }[/math].
  2. Slightly shuffling the words in the sentence where each word can be at most [math]\displaystyle{ k }[/math] positions away from its original position.

The authors found in practice [math]\displaystyle{ p_{wd}= 0.1 }[/math] and [math]\displaystyle{ k=3 }[/math] to be good parameters.

Loss Component 1: Reconstruction Loss

This component captures the expected cross entropy loss between [math]\displaystyle{ x }[/math] and the reconstructed [math]\displaystyle{ \hat{x} }[/math], where [math]\displaystyle{ \hat{x} }[/math] is constructed as follows:

  1. Construct [math]\displaystyle{ C(x) }[/math], noisy version of [math]\displaystyle{ x }[/math] from a language [math]\displaystyle{ \ell }[/math]
  2. Input [math]\displaystyle{ C(x) }[/math] and language [math]\displaystyle{ \ell }[/math] into the encoder parameterized with [math]\displaystyle{ \theta_{enc} }[/math], to get [math]\displaystyle{ e(C(x),\ell) }[/math].
  3. Input the [math]\displaystyle{ e(C(x),\ell) }[/math] and [math]\displaystyle{ \ell }[/math] into the decoder parameterized with [math]\displaystyle{ \theta_{dec} }[/math], to get [math]\displaystyle{ \hat{x} \sim d(e(C(x),\ell),\ell) }[/math].

\begin{align} \mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell) = E_{x\sim D_\ell, \hat{x}\sim d(e(C(x),\ell),\ell)}[\Delta(\hat{x},x)] \end{align}

Loss Component 2: Cross Domain Translation Loss

This component captures the expected cross entropy loss between [math]\displaystyle{ x }[/math] and the reconstructed [math]\displaystyle{ \hat{x} }[/math] from the translation of [math]\displaystyle{ x }[/math], where [math]\displaystyle{ \hat{x} }[/math] is constructed as follows:

  1. Using the current iteration of the translation model [math]\displaystyle{ M }[/math], construct translation [math]\displaystyle{ M(x) }[/math] in [math]\displaystyle{ \ell_2 }[/math], where [math]\displaystyle{ x }[/math] is from a language [math]\displaystyle{ \ell_1 }[/math]. (Initialization of M is using a different translation model discussed later)
  2. Construct [math]\displaystyle{ C(M(x)) }[/math], noisy version of translation [math]\displaystyle{ M(x) }[/math].
  3. Input [math]\displaystyle{ C(M(x)) }[/math] and language [math]\displaystyle{ \ell_2 }[/math] into the encoder parameterized with [math]\displaystyle{ \theta_{enc} }[/math], to get [math]\displaystyle{ e(C(M(x)),\ell_2) }[/math].
  4. Input [math]\displaystyle{ e(C(M(x)),\ell_2) }[/math] and [math]\displaystyle{ \ell_1 }[/math] into the decoder parameterized with [math]\displaystyle{ \theta_{dec} }[/math], to get [math]\displaystyle{ \hat{x} \sim d(e(C(M(x)),\ell_2),\ell_1) }[/math].

\begin{align} \mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell_1,\ell_2) = E_{x\sim D_{\ell_1}, \hat{x}\sim d(e(C(M(x)),\ell_2),\ell_1)}[\Delta(\hat{x},x)] \end{align}

Loss Component 3: Adversarial Loss

A discriminator parameterized with [math]\displaystyle{ \theta_D }[/math] is trained to to distinguish the language [math]\displaystyle{ \ell }[/math] given a vector [math]\displaystyle{ z }[/math] in the latent space. It is trained by minimizing the cross entropy loss [math]\displaystyle{ \mathcal{L}_D }[/math] of the predicted language and the ground truth language, given the language produced the vector [math]\displaystyle{ z }[/math].

The enconder is trained to fool the discriminator, and the adversarial loss is minimized when given an encoding of [math]\displaystyle{ x }[/math] in language [math]\displaystyle{ \ell_i }[/math], the discriminator predicts that it comes from [math]\displaystyle{ \ell_j }[/math].

The end result at convergence is that the representation in the latent space for language [math]\displaystyle{ \ell_1 }[/math] is indistinguishable from language [math]\displaystyle{ \ell_2 }[/math].

\begin{align} \mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D) = -E_{x_i,\ell_i}[log p_D (\ell_j|e(x_i,\ell_i))] \end{align} with [math]\displaystyle{ \ell_j=\ell_1 }[/math] if [math]\displaystyle{ \ell_i=\ell_2 }[/math], and vice versa.

Final Objective Loss Function

Combining all 3 components the following objective function is obtained:

\begin{align*} \mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}) &= \lambda_{auto}[\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src)+\mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt)]\\ &+ \lambda_{cd}[\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},src,tgt) +\mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z},tgt,src)]\\ &+\lambda_{adv}\mathcal{L}_{adv}(\theta_{enc}, \mathcal{Z}|\theta_D) \end{align*}

[math]\displaystyle{ \lambda_{auto} }[/math], [math]\displaystyle{ \lambda_{cd} }[/math], and [math]\displaystyle{ \lambda_{adv} }[/math] are hyperparameters that represent the weights of each component. The discriminator loss [math]\displaystyle{ \mathcal{L}_D }[/math] is minimized in parallel, because its parameter [math]\displaystyle{ \theta_{D} }[/math] is used in the last component.

    • insert Figure2 here**

Training

The training is iterative, where the translation model [math]\displaystyle{ M^{(t)} }[/math]improves at each time step [math]\displaystyle{ t }[/math]. To seed the training, [math]\displaystyle{ M^{(1)} }[/math] is a translation from a different unsupervised word-by-word translation as proposed by [Conneau, 2017]. Each iteration of the training is as follows:

  1. Use [math]\displaystyle{ M^{(t)} }[/math] to obtain a translation [math]\displaystyle{ M^{(t)}(x) }[/math].
  2. Use [math]\displaystyle{ M^{(t)}(x) }[/math] and [math]\displaystyle{ x }[/math] to train auto-encoder, training discriminator at the same time (i.e. minimizing final objective function)
  3. Update [math]\displaystyle{ M^{(t+1)} }[/math], repeat.
    • inser Algorithm1 here**

Model Selection Criterion

In Machine Translation, the Bilingual Evaluation Understudy(BLEU) Score is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation. However since the training is unsupervised without parallel copora, BLEU cannot be used during training to select hyper-parameters.

The paper proposes a scoring method that correlates with the BLEU. The main idea is to assess BLEU score between [math]\displaystyle{ x }[/math] and the back-translated version using the model(i.e. translate [math]\displaystyle{ x }[/math] to the target language then translate it back to source language). With this it is possible to score the quality of the translation model without supervision.

Results

Critique


Other Sources

References

  1. [Lample, 2018] Lample, G., Conneau, A., Ranzato, M., Denoyer, L., "Unsupervised Machine Translation Using Monolingual Corpora Only". arXiv:1711.00043
  1. [Conneau, 2017] Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H., "Word Translation without Parallel Data". arXiv:1710.04087