statwiki - User contributions [US]

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-29T20:15:24Z

Lwali: /* STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION */

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [[https://arxiv.org/pdf/1711.00489.pdf]]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice to not have a singular steady learning rate for the learning phase of the neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima it is beneficial for us to take large steps towards it as it would require a lesser number of steps to reach but as we approach it our step size should decrease otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise g decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes g decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' <math> \epsilon_eff = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Fix your classifier: the marginal value of training the last weight layer

2018-11-29T20:06:49Z

Lwali: /* Language Modelling */

=Introduction=

Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have
a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.

=Brief Overview=

In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.

=Previous Work=

Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:

* Weight sharing and specification (Han et al., 2015)

* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)

* Low-rank approximations to speed up CNN (Tai et al., 2015)

* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)

Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.

=Background=

Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.

== Shortcomings of the final classification layer and its solution ==

Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).

It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.

=Proposed Method=

The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).

==Using a fixed classifier==

Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.

In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.

For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation

<math>
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i ∈ </math> { <math> {1, . . . , C} </math> }

and reducing the expected negative log likelihood with respect to ground-truth target <math> t ∈ </math> { <math> {1, . . . , C} </math> },
by minimizing the loss function:

<math>
L(x, t) = − log {v_t} = −{w_t}·{x} − b_t + log ({\sum_{j}^{C}e^{w_j . x + b_j}})
</math>

where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.

==Choosing the projection matrix==

To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q ∈ R^{N×C} </math>, such that <math> ∀ i ≠ j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition

As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:

<center><math>
\hat{x} = \frac{x}{||x||_{2}}
</math></center>

This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b ∈ R^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:

<center>
<math>
v_i=\frac{e^{\alpha q_i · \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j · \hat{x} + b_j}}, i ∈ </math> { <math> {1,...,C} </math>}
</center>

and the loss to be minimized is:

<center><math>
L(x, t) = -\alpha q_t · \frac{x}{||x||_{2}} + b_t + log (\sum_{i=1}^{C} exp((\alpha q_i · \frac{x}{||x||_{2}} + b_i)))
</math></center>

where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t ∈ </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in
[[Media: figure1_log_behave.png| Figure 1]].

<center>[[File:figure1_log_behave.png]]</center>

When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is

<center>[[File:caloss.png]]</center>

But its final validation accuracy has slight decrease, compared to original models.

==Using a Hadmard matrix==

To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} ∈ </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:

<center><math>
y = \hat{H} \hat{x} + b
</math></center>

This usage allows two main benefits:
* A deterministic, low-memory and easily generated matrix that can be used for classification.
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.

Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.

=Experimental Results=

The authors have evaluated their proposed model on the following datasets:

==CIFAR-10/100==

===About the dataset===

CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.

===Training Details===

The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].

<center>[[File: figure1_resnet_cifar10.png]]</center>

These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors conjecture is that with the new fixed parameterization, the network can no longer increase the
norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.

The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier.

<center>[[File: figure3_alpha_resnet_cifar.png]]</center>

The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.

==IMAGENET==

===About the dataset===

The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.

===Experiment Details===

The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.

For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy.

The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].

<center>[[File: table1_fixed_results.png]]</center>

==Language Modelling==

Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.

=Discussion=

==Implications and use cases==

With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.

==Possible Caveats==

The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.

==Future Work==

The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.

Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.

A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]

=Conclusion=

In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.

Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.

=Critique=

The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).

The paper presents a very interesting idea which can be applied for most of networks in deep learning area. However, technical proofs of the effect of the algorithm should be considered in order to generalize and be appreciated further.

=References=

The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.

Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6
(6):1184–1238, 1978.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.

Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.

Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,
pp. 157, 2017.

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.

learn what not to learn

2018-11-29T19:45:36Z

Lwali: /* Action Elimination */

=Introduction=
In reinforcement learning, it is often difficult for an agent to learn when the action space is large, especially the difficulties from function approximation and exploration. In some cases many actions are irrelevant and it is sometimes easier for the algorithm to learn which action not to take. The paper proposes a new reinforcement learning approach for dealing with large action spaces based on action elimination by restricting the available actions in each state to a subset of the most likely ones. There is a core assumption that this method can be easier to predict which actions in each state are invalid or inferior and use that information to control. More specifically, it proposes a system that learns the approximation of a Q-function and concurrently learns to eliminate actions. The method utilizes an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played (e.g., Player: "Climb the tree." Parser: "There are no trees to climb"). Then a machine learning model can be trained to generalize to unseen states.

The paper focuses on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a Deep Q-Network (DQN) and an Action Elimination Network (AEN), both using the Convolutional Neural Networks (CNN) for Natural Language Processing (NLP) tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''

The text-based game called "Zork", which lets players to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AEN algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.

Below shows an example for the Zork interface:

[[File:lnottol_fig1.png|500px|center]]

All states and actions are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.

=Related Work=
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.

Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.

DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.

RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.

=Action Elimination=

The approach in the paper builds on the standard Reinforcement Learning formulation. At each time step <math>t</math>, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and observes next state <math display="inline">s_{t+1} </math> according to a transition kernel <math>P(s_{t+1}|s_t,a_t)</math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discounted cumulative return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]</math>, where <math> 0< \gamma <1 </math>. The Q-function is <math display="inline">Q^\pi(s,a)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s,a_0=a]</math>, and it can be optimized by Q-learning algorithm.

After executing an action, the agent observes a binary elimination signal <math>e(s, a)</math> to determine which actions not to take. It equals 1 if action <math>a</math> may be eliminated in state <math>s</math> (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following definitions:

'''Definition 1:'''

Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

The set of valid state-action pairs contains all of the state-action pairs that are a part of some optimal policy, i.e., only strictly suboptimal state-actions can be invalid.

'''Definition 2:'''

Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

'''Definition 3:'''

Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.

==Advantages of Action Elimination==
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity.

Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation errors. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping (i.e., only the Q-values of the valid state-action pairs) leading to faster convergence and better solution.

Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. Assume that there are <math>A'</math> actions that should be eliminated and are <math>\epsilon</math>-optimal, i.e. their value is at least <math>V^*(s)-\epsilon</math>. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.

Because it is difficult to embedde the elimination signal into the MDP, the authors use contextual multi-armed bandits to decouple the elimination signal from the MDP, which can correctly eliminate actions when applying standard Q learning into learning process.

==Action elimination with contextual bandits==

Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.

According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math>

<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math>

Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:

<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math>

with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.

==Concurrent Learning==
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space.

If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.

'''Proposition 1:'''

Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2
+1</math> times.

Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.

=Method=
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.

==Architecture of action elimination framework==

[[File:lnottol_fig1b.png|300px|center]]

After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]

==Psuedocode of the Algorithm==

[[File:lnottol_fig2.png|750px|center]]

AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.

=Experiment=
==Zork domain==
The world of Zork presents a rich environment with a large state and action space.
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm.

Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments.

===Egg Quest===
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.

[[File:AEF_zork_comparison.png|1200px|thumb|center|Performance of agents in the egg quest.]]

Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.

===Troll Quest===
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.

[[File:AEF_troll_comparison.png|400px|thumb|center|Results in the Troll Quest.]]

The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.

===Open Zork===
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.

[[]]

[[File:AEF_open_zork_comparison.png|600px|thumb|center|Results in "Open Zork".]]

The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.

=Conclusion=
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.

For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.

They also hope to to investigate other mechanisms for action elimination, such as eliminating actions that result from low Q-values as in Even-Dar, Mannor, and Mansour, 2003.

The authors also hope to generate elimination signals in real-world domains and achieve the purpose of eliminating the signal implicitly.

=Critique=
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO.
Even with the critique above, the paper presents mathematical/theoretical justifications of the methodology. Moreover, since the methodology is built on the standard RL framework, this means that other variant RL algorithms can apply the idea to decrease the complexity and increase the performance. Moreover, the there are some rooms for applying technical variantions for the algorithm.

=Reference=

CapsuleNets

2018-11-29T16:56:20Z

Lwali: /* Dynamic Routing */

The paper "Dynamic Routing Between Capsules" was written by three researchers at Google Brain: Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. This paper was published and presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California. The same three researchers recently published a highly related paper "Matrix Capsules with EM Routing" for ICLR 2018.

=Motivation=

Ever since AlexNet eclipsed the performance of competing architectures in the 2012 ImageNet challenge, convolutional neural networks have maintained their dominance in computer vision applications. Despite the recent successes and innovations brought about by convolutional neural networks, some assumptions made in these networks are perhaps unwarranted and deficient. Using a novel neural network architecture, the authors create CapsuleNets, a network that they claim is able to learn image representations in a more robust, human-like manner. With only a 3 layer capsule network, they achieved near state-of-the-art results on MNIST.
==Adversarial Examples==

First discussed by Christian Szegedy et. al. in late 2013, adversarial examples have been heavily discussed by the deep learning community as a potential security threat to AI learning. Adversarial examples are defined as inputs that an attacker creates intentionally fool a machine learning model. An example of an adversarial example is shown below:

[[File:adversarial_img_1.png ‎|center]]
To the human eye, the image appears to be a panda both before and after noise is injected into the image, whereas the trained ConvNet model discerns the noisy image as a Gibbon with almost 100% certainty. The fact that the network is unable to classify the above image as a panda after the epsilon perturbation leads to many potential security risks in AI dependent systems such as self-driving vehicles. Although various methods have been suggested to combat adversarial examples, robust defences are hard to construct due to the inherent difficulties in constructing theoretical models for the adversarial example crafting process. However, beyond the fact that these examples may serve as a security threat, it emphasizes that these convolutional neural networks do not learn image classification/object detection patterns the same way that a human would. Rather than identifying the core features of a panda such as: its eyes, mouth, nose, and the gradient changes in its black/white fur, the convolutional neural network seems to be learning image representations in a completely different manner. Deep learning researchers often attempt to model neural networks after human learning, and it is clear that further steps must be taken to robustify ConvNets against targeted noise perturbations.

==Drawbacks of CNNs==
Hinton claims that the key fault with traditional CNNs lies within the pooling function. Although pooling builds translational invariance into the network, it fails to preserve spatial relationships between objects. When we pool, we effectively reduce a kxk kernel of convolved cells into a scalar input. This results in a desired local invariance without inhibiting the network's ability to detect features, but causes valuable spatial information to be lost.

In the example below, the network is able to detect the similar features (eyes, mouth, nose, etc) within both images, but fails to recognize that one image is a human face, while the other is a Picasso-esque due to the CNN's inability to encode spatial relationships after multiple pooling layers.

[[File:Equivariance Face.png ‎|center]]

Conversely, we hope that a CNN can recognize that both of the following pictures contain a kitten. Unfortunately, when we feed the two images into a ResNet50 architecture, only the first image is correctly classified, while the second image is predicted to be a guinea pig.

[[File:kitten.jpeg ‎|center]]

[[File:kitten-rotated-180.jpg ‎|center]]

For a more in depth discussion on the problems with ConvNets, please listen to Geoffrey Hinton's talk "What is wrong with convolutional neural nets?" given at MIT during the Brain & Cognitive Sciences - Fall Colloquium Series (December 4, 2014).

==Intuition for Capsules==
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Hinton argues that our brains reason visual information by deconstructing it into a hierarchical representation which we then match to familiar patterns and relationships from memory. The key difference between this understanding and the functionality of CNNs is that recognition of an object should not depend on the angle from which it is viewed.

To enforce rotational and translational equivariance, Capsule Networks store and preserve hierarchical pose relationships between objects. The core idea behind capsule theory is the explicit numerical representations of relative relationships between different objects within an image. Building these relationships into the Capsule Networks model, the network is able to recognize newly seen objects as a rotated view of a previously seen object. For example, the below image shows the Statue of Liberty under five different angles. If a person had only seen the Statue of Liberty from one angle, they would be able to ascertain that all five pictures below contain the same object (just from a different angle).

[[File:Rotational Invariance.jpeg ‎|center]]

Building on this idea of hierarchical representation of spatial relationships between key entities within an image, the authors introduce Capsule Networks. Unlike traditional CNNs, Capsule Networks are better equipped to classify correctly under rotational invariance. Furthermore, the authors managed to achieve state of the art results on MNIST using a fraction of the training samples that alternative state of the art networks require.

=Background, Notation, and Definitions=

==What is a Capsule==
"Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold."

In essence, capsules store object properties in a vector form; probability of detection is encoded as the vector's length, while spatial properties are encoded as the individual vector components. Thus, when a feature is present but the image captures it under a different angle, the probability of detection remains unchanged.

A brief overview/understanding of capsules can be found in other papers from the author. To quote from [https://openreview.net/pdf?id=HJWLfGWRb this paper]:

<blockquote>
A capsule network consists of several layers of capsules. The set of capsules in layer L is denoted
as <math>\Omega_L</math>. Each capsule has a 4x4 pose matrix, <math>M</math>, and an activation probability, <math>a</math>. These are like the
activities in a standard neural net: they depend on the current input and are not stored. In between
each capsule i in layer L and each capsule j in layer L + 1 is a 4x4 trainable transformation matrix,
<math>W_{ij}</math> . These <math>W_{ij}</math>'s (and two learned biases per capsule) are the only stored parameters and they
are learned discriminatively. The pose matrix of capsule i is transformed by <math>W_{ij}</math> to cast a vote
<math>V_{ij} = M_iW_{ij}</math> for the pose matrix of capsule j. The poses and activations of all the capsules in layer
L + 1 are calculated by using a non-linear routing procedure which gets as input <math>V_{ij}</math> and <math>a_i</math> for all
<math>i \in \Omega_L, j \in \Omega_{L+1}</math>
</blockquote>
<math></math>

==Notation==

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The paper performs a non-linear squashing operation to ensure that vector length falls between 0 and 1, with shorter vectors (less likely to exist entities) being shrunk towards 0.

\begin{align} \mathbf{v}_j &= \frac{||\mathbf{s}_j||^2}{1+ ||\mathbf{s}_j||^2} \frac{\mathbf{s}_j}{||\mathbf{s}_j||} \end{align}

where <math>\mathbf{v}_j</math> is the vector output of capsule <math>j</math> and <math>s_j</math> is its total input.

For all but the first layer of capsules, the total input to a capsule <math>s_j</math> is a weighted sum over all “prediction vectors” <math>\hat{\mathbf{u}}_{j|i}</math> from the capsules in the layer below and is produced by multiplying the output <math>\mathbf{u}i</math> of a capsule in the layer below by a weight matrix <math>\mathbf{W}ij</math>

\begin{align}
\mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, ~\hspace{0.5em} \hat{\mathbf{u}}_{j|i}= \mathbf{W}_{ij}\mathbf{u}_i
\end{align}
where the <math>c_{ij}</math> are coupling coefficients that are determined by the iterative dynamic routing process.

The coupling coefficients between capsule <math>i</math> and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits <math>b_{ij}</math> are the log prior probabilities that capsule <math>i</math> should be coupled to capsule <math>j</math>.

\begin{align}
c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}
\end{align}

=Network Training and Dynamic Routing=

==Understanding Capsules==
The notation can get somewhat confusing, so I will provide intuition behind the computational steps within a capsule. The following image is taken from naturomic's talk on Capsule Networks.

[[File:CapsuleNets.jpeg|center|800px]]

The above image illustrates the key mathematical operations happening within a capsule (and compares them to the structure of a neuron). Although the operations are rather straightforward, it's crucial to note that the capsule computes an affine transformation onto each input vector. The length of the input vectors <math>\mathbf{u}_{i}</math> represent the probability of entity <math>i</math> existing in a lower level. This vector is then reoriented with an affine transform using <math>\mathbf{W}_{ij}</math> matrices that encode spatial relationships between entity <math>\mathbf{u}_{i}</math> and other lower level features.

We illustrate the intuition behind vector-to-vector matrix multiplication within capsules using the following example: if vectors <math>\mathbf{u}_{1}</math>, <math>\mathbf{u}_{2}</math>, and <math>\mathbf{u}_{3}</math> represent detection of eyes, nose, and mouth respectively, then after multiplication with trained weight matrices <math>\mathbf{W}_{ij}</math> (where j denotes existence of a face), we should get a general idea of the general location of the higher level feature (face), similar to the image below.

[[File:Predictions.jpeg ‎|center]]

==Dynamic Routing==
A capsule <math>i</math> in a lower-level layer needs to decide how to send its output vector to higher-level capsules <math>j</math>. This decision is made with probability proportional to <math>c_{ij}</math>. If there are <math>K</math> capsules in the level that capsule <math>i</math> routes to, then we know the following properties about <math>c_{ij}</math>: <math>\sum_{j=1}^M c_{ij} = 1, c_{ij} \geq 0</math>

In essence, the <math>\{c_{ij}\}_{j=1}^M</math> denotes a discrete probability distribution with respect to capsule <math>i</math>'s output location. Lower level capsules decide which higher level capsules to send vectors into by adjusting the corresponding routing weights <math>\{c_{ij}\}_{j=1}^M</math>. After a few iterations in training, numerous vectors will have already been sent to all higher level capsules. Based on the similarity between the current vector being routed and all vectors already sent into the higher level capsules, we decide which capsule to send the current vector into.
[[File:Dynamic Routing.png|center|900px]]

In the image above, we notice that a cluster of points similar to the current vector has already been routed into capsule K, while most points in capsule J are highly dissimilar. It thus makes more sense to route the current observation into capsule K; we adjust the corresponding weight upwards during training.

These weights are determined through the dynamic routing procedure:
[[File:Routing Algo.png‎|900px]]

Although dynamic routing is not the only manner in which we can encode relationships between capsules, the premise of the paper is to demonstrate the capabilities of capsules under a simple implementation. Since the paper's release in 2017, numerous alternative routing implementations have been released including an EM matrix routing algorithm by the same authors (ICLR 2018).

=Architecture=
The capsule network architecture given by the authors has 11.36 million trainable parameters. The paper itself is not very detailed on exact implementation of each architectural layer, and hence it leaves some degree of ambiguity on coding various aspects of the original network. The capsule network has 6 overall layers, with the first three layers denoting components of the encoder, and the last 3 denoting components of the decoder.

==Loss Function==
[[File:Loss Function.png‎|900px]]

The cost function looks very complicated, but can be broken down into intuitive components. Before diving into the equation, remember that the length of the vector denotes the probability of object existence. The left side of the equation denotes loss when the network classifies an observation correctly; the term becomes zero when the classification is incorrect. To compute loss when the network correctly classifies the label, we subtract the vector norm from a fixed quantity <math>m^+ := 0.9</math>. On the other hand, when the network classifies a label incorrectly, we penalize the loss based on the network's confidence in the incorrect label; we compute the loss by subtracting <math>m^- := 0.1</math> from the vector norm.

A graphical representation of loss function values under varying vector norms is given below.
[[File:Loss function chart.png|900px]]

==Encoder Layers==
All experiments within this paper were conducted on the MNIST dataset, and thus the architecture is built to classify the corresponding dataset. For more complex datasets, the experiments were less promising.

[[File:Architecture.png|center|900px]]

The encoder layer takes in a 28x28 MNIST image and learns a 16 dimensional representation of instantiation parameters.

'''Layer 1: Convolution''':
This layer is a standard convolution layer. Using kernels with size 9x9x1, a stride of 1, and a ReLU activation function, we detect the 2D features within the network.

'''Layer 2: PrimaryCaps''':
We represent the low level features detected during convolution as 32 primary capsules. Each capsule applies eight convolutional kernels with stride 2 to the output of the convolution layer and feeds the corresponding transformed tensors into the DigiCaps layer.

'''Layer 3: DigiCaps''':
This layer contains 10 digit capsules, one for each digit. As explained in the dynamic routing procedure, each input vector from the PrimaryCaps layer has its own corresponding weight matrix <math>W_{ij}</math>. Using the routing coefficients <math>c_{ij}</math> and temporary coefficients <math>b_{ij}</math>, we train the DigiCaps layer to output a ten 16 dimensional vectors. The length of the <math>i^{th}</math> vector in this layer corresponds to the probability of detection of digit <math>i</math>.

==Decoder Layers==
The decoder layer aims to train the capsules to extract meaningful features for image detection/classification. During training, it takes the 16 layer instantiation vector of the correct (not predicted) DigiCaps layer, and attempts to recreate the 28x28 MNIST image as best as possible. Setting the loss function as reconstruction error (Euclidean distance between the reconstructed image and original image), we tune the capsules to encode features that are meaningful within the actual image.

[[File:Decoder.png|center|900px]]

The layer consists of three fully connected layers, and transforms a 16x1 vector from the encoder layer into a 28x28 image.

In addition to the digicaps loss function, we add reconstruction error as a form of regularization. We minimize the Euclidean distance between the outputs of the logistic units and the pixel intensities of the original and reconstructed images. We scale down this reconstruction loss by 0.0005 so that it does not dominate the margin loss during training. As illustrated below, reconstructions from the 16D output of the CapsNet are robust while keeping only important details.

[[File:Reconstruction.png|center|900px]]

=MNIST Experimental Results=

==Accuracy==
The paper tests on the MNIST dataset with 60K training examples, and 10K testing. Wan et al. [2013] achieves 0.21% test error with ensembling and augmenting the data with rotation and scaling. They achieve 0.39% without them. As shown in Table 1, the authors manage to achieve 0.25% test error with only a 3 layer network; the previous state of the art only beat this number with very deep networks. This example shows the importance of routing and reconstruction regularizer, which boosts the performance. On the other hand, while the accuracies are very high, the number of parameters is much smaller compared to the baseline model.

[[File:Accuracies.png|center|900px]]

==What Capsules Represent for MNIST==
The following figure shows the digit representation under capsules. Each row shows the reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 in the range [−0.25, 0.25]. By tweaking the values, we notice how the reconstruction changes, and thus get a sense for what each dimension is representing. The authors found that some dimensions represent global properties of the digits, while other represent localized properties.
[[File:CapsuleReps.png|center|900px]]

One example the authors provide is: different dimensions are used for the length of the ascender of a 6 and the size of the loop. The variations include stroke thickness, skew and width, as well as digit-specific variations. The authors are able to show dimension representations using a decoder network by feeding a perturbed vector.

==Robustness of CapsNet==
The authors conclude that DigitCaps capsules learn more robust representations for each digit class than traditional CNNs. The trained CapsNet becomes moderately robust to small affine transformations in the test data.

To compare the robustness of CapsNet to affine transformations against traditional CNNs, both models (CapsNet and a traditional CNN with MaxPooling and DropOut) were trained on a padded and translated MNIST training set, in which each example is an MNIST digit placed randomly on a black background of 40 × 40 pixels. The networks were then tested on the [http://www.cs.toronto.edu/~tijmen/affNIST/ affNIST] dataset (MNIST digits with random affine transformation). An under-trained CapsNet which achieved 99.23% accuracy on the MNIST test set achieved a corresponding 79% accuracy on the affnist test set. A traditional CNN achieved similar accuracy (99.22%) on the mnist test set, but only 66% on the affnist test set.

=MultiMNIST & Other Experiments=

==MultiMNIST==
To evaluate the performance of the model on highly overlapping digits, the authors generate a 'MultiMNIST' dataset. In MultiMNIST, images are two overlaid MNIST digits of the same set(train or test) but different classes. The results indicate a classification error rate of 5%. Additionally, CapsNet can be used to segment the image into the two digits that compose it. Moreover, the model is able to deal with the overlaps and reconstruct digits correctly since each digit capsule can learn the style from the votes of PrimaryCapsules layer (Figure 5).

There are some additional steps to generating the MultiMNIST dataset.

1. Both images are shifted by up to 4 pixels in each direction resulting in a 36 × 36 image. Bounding boxes of digits in MNIST overlap by approximately 80%, so this is used to make both digits identifiable (since there is no RGB difference learnable by the network to separate the digits)

2. The label becomes a vector of two numbers, representing the original digit and the randomly generated (and overlaid) digit.

[[File:CapsuleNets MultiMNIST.PNG|600px|thumb|center|Figure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.
The two reconstructed digits are overlayed in green and red as the lower image. The upper image
shows the input image. L:(l1; l2) represents the label for the two digits in the image and R:(r1; r2)
represents the two digits used for reconstruction. The two right most columns show two examples
with wrong classification reconstructed from the label and from the prediction (P). In the (2; 8)
example the model confuses 8 with a 7 and in (4; 9) it confuses 9 with 0. The other columns have
correct classifications and show that the model accounts for all the pixels while being able to assign
one pixel to two digits in extremely difficult scenarios (column 1 − 4). Note that in dataset generation
the pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a
digit that is neither the label nor the prediction. These columns suggest that the model is not just
finding the best fit for all the digits in the image including the ones that do not exist. Therefore in case
of (5; 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that fit best and account for
all the pixels. Also, in the case of (8; 1) the loop of 8 has not triggered 0 because it is already accounted
for by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other
support.]]

==Other datasets==
The authors also tested the proposed capsule model on CIFAR10 dataset and achieved an error rate of 10.6%. The model tested was an ensemble of 7 models. Each of the models in the ensemble had the same architecture as the model used for MNIST (apart from 3 additional channels and 64 different types of primary capsules being used). These 7 models were trained on 24x24 patches of the training images for 3 iterations. During experimentation, the authors also found out that adding an additional none-of-the-above category helped improved the overall performance. The error rate achieved is comparable to the error rate achieved by a standard CNN model. According to the authors, one of the reasons for low performance is the fact that background in CIFAR-10 images are too varied for it to be adequately modeled by reasonably sized capsule net.

The proposed model was also evaluated using a small subset of SVHN dataset. The network trained was much smaller and trained using only 73257 training images. The network still managed to achieve an error rate of 4.3% on the test set.

=Critique=
Although the network performs incredibly favorably in the author's experiments, it has a long way to go on more complex datasets. On CIFAR 10, the network achieved subpar results, and the experimental results seem to be worse when the problem becomes more complex. This is anticipated, since these networks are still in their early stage; later innovations might come in the upcoming decades/years. It could also be wise to apply the model to other datasets with larger sizes to make the functionality more acceptable. MNIST dataset has simple patterns and even if the model wanted to be presented with only one dataset, it was better not to be MNIST dataset especially in this case that the focus is on human-eye detection and numbers are not that regular in real-life experiences.

Hinton talks about CapsuleNets revolutionizing areas such as self-driving, but such groundbreaking innovations are far away from CIFAR10, and even further from MNIST. Only time can tell if CapsNets will live up to their hype.

Capsules inherently segment images and learn a lower dimensional embedding in a new manner, which makes them likely to perform well on segmentation and computer vision tasks once further research is done.

Additionally, these networks are more interpretable than CNNs, and have strong theoretical reasoning for why they could work. Naturally, it would be hard for a new architecture to beat the heavily researched/modified CNNs.

* ([https://openreview.net/forum?id=HJWLfGWRb]) it's not fully clear how effective it can be performed / how scalable it is. Evaluation is performed on a small dataset for shape recognition. The approach will need to be tested on larger, more challenging datasets.

=Future Work=
The same authors [N. F. Geoffrey E Hinton, Sara Sabour] presented another paper "MATRIX CAPSULES WITH EM ROUTING" in ICLR 2018, which achieved better results than the work presented in this paper. They presented a new multi-layered capsule network architecture, implemented an EM routing procedure, and introduced "Coordinate Addition". This new type reduced number of errors by 45%, and performed better than standard CNN on white box adversarial attacks. Capsule architectures are gaining interest because of their ability to achieve equivariance of parts, and employ a new form of pooling called "routing" (as opposed to max pooling) which groups parts that make similar predictions of the whole to which they belong, rather than relying on spatial co-locality.
Moreover, we may try to change the curvature and sensitivities to various factors by introducing new form of loss function. It may improve the performance of the model for more complicated data set which is one of the model's drawback.

Moreover, as mentioned in critiques, a good future work for this group would be making the model more robust to the dataset and achieve acceptable performance on datasets with more regularly seen images in real life experiences.

=References=
#N. F. Geoffrey E Hinton, Sara Sabour. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.
#S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017
# Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011), Transforming Auto-encoders
#Geoffrey Hinton's talk: What is wrong with convolutional neural nets? - Talk given at MIT. Brain & Cognitive Sciences - Fall Colloquium Series. [https://www.youtube.com/watch?v=rTawFwUvnLE ]
#Understanding Hinton’s Capsule Networks - Max Pechyonkin's series [https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b]
#ReferencesMartín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machinelearning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.
#Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visualattention.arXiv preprint arXiv:1412.7755, 2014.
#Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network.arXiv preprintarXiv:1511.02583, 2015.
#Dan C Cire ̧san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classification.arXiv preprint arXiv:1102.0183,2011.
#Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit numberrecognition from street view imagery using deep convolutional neural networks.arXiv preprintarXiv:1312.6082, 2013.

DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS

2018-11-29T16:31:43Z

Lwali: /* Related Work */

=Introduction=
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. However, within several areas, like eg: computation social science, interpretability is utmost. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.

Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.

=Related Work=

1. Interaction Detection approaches:
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.
* Define all interaction forms of interest, then later finds the important ones.
- The paper's goal is to detect interactions without compromising the functional forms.

2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations.
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.

The approach in this paper is to extract interactions between variables from the neural network weights.

=Notations=
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.

1. Vector: Vectors are defined with bold-lowercases, '''v, w'''

2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''

3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}

=Interaction=
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below.

[[File:def_interaction.PNG|900px|center]]

From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.

Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.

One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.

The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:

[[File:prop2.PNG|900px|center]]

Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.
}
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space.

[[File:network1.PNG|500px|center]]

==Measuring influence in hidden layers==
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as,

[[File:def3.PNG|900px|center]]

Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.

==Quantifying influence==
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as,

[[File:measure1.PNG|900px|center]]

The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer.

For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.

Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:

[[File:algorithm1.PNG|850px|center]]

=Cut off Model=
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,

[[File:gam1.PNG|900px|center]]

From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.

=Experiment=
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.

[[File:output11.PNG|300px|center]]

For the experiment, we are going to test on 10 synthetic functions.

[[File:synthetic.PNG|900px|center]]

And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.

[[File:performance_mlpm.PNG|900px|center]]

[[File:performance2_mlpm.PNG|900px|center]]

The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.

=Limitations=
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms.

Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.

=Conclusion=
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.

=Critique=
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.

2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.

=Reference=
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.

[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.

[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.

[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018.

[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.

[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-29T15:02:33Z

Lwali: /* Experiments */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive(AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focuses on time series with multi-variate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. (Figure 1) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, then the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR, VARMA[1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
Thus, we focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous
moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are being modeled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\hat{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes sent by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture the paper also discusses the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 400px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 600px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 400px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- tional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learn- ing in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Acceler- ating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Em- pirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the dif- ficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

Unsupervised Neural Machine Translation

2018-11-26T01:19:29Z

Lwali: /* Back-Translation */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist.

Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.
Then iteratively perform:
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (known as encrypted or encoded information because it contains a form of the original plain text that is unreadable by a human or computer without the proper cipher to decrypt it) and model the generation process of the ciphertext as a two-stage process including the generation of the original English sequence and the probabilistic replacement of the words in it. This approach is able to take the advantage of the incorporation of syntactic knowledge of the languages.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language(German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training.

Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back translated to the source language and compared with the original sentence.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.

= Methodology =

The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experiment with an alternative way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016]. BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).

The tokens are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same langugae. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce reliance of the model on the order of words in a sentence which may be different in the source and target languages.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

The results show that backtranslation is essential for the proposed system to work properly. The denoising technique alone is below the baseline while big improvements appear when introducing backtranslation.

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest to use larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an annonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Unsupervised Neural Machine Translation

2018-11-21T17:19:10Z

Lwali:

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been significant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to the problem of training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T17:15:29Z

Lwali: /* Other related work and inspirations */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been significant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to the problem of training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

File:Table2 lwali.png

2018-11-21T15:07:27Z

Lwali:

Unsupervised Neural Machine Translation

2018-11-21T15:06:26Z

Lwali: /* Qualitative Analysis */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T02:44:52Z

Lwali: /* Experiments and Results */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

stat946F18

2018-11-21T02:26:57Z

Lwali: /* Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing] */

== [[F18-STAT946-Proposal| Project Proposal ]] ==

=Paper presentation=

[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]

= Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]
|-
|Oct 25 || Dhruv Kumar || 1 || Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs || [https://openreview.net/pdf?id=rkRwGg-0Z Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs Summary]
[https://wiki.math.uwaterloo.ca/statwiki/images/e/ea/Beyond_Word_Importance.pdf Slides]
|-
|Oct 25 || Amirpasha Ghabussi || 2 || DCN+: Mixed Objective And Deep Residual Coattention for Question Answering || [https://openreview.net/pdf?id=H1meywxRW Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering Summary]
|-
|Oct 25 || Juan Carrillo || 3 || Hierarchical Representations for Efficient Architecture Search || [https://arxiv.org/abs/1711.00436 Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search Summary]
[https://wiki.math.uwaterloo.ca/statwiki/images/1/15/HierarchicalRep-slides.pdf Slides]
|-
|Oct 30 || Manpreet Singh Minhas || 4 || End-to-end Active Object Tracking via Reinforcement Learning || [http://proceedings.mlr.press/v80/luo18a/luo18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End_to_end_Active_Object_Tracking_via_Reinforcement_Learning Summary]
|-
|Oct 30 || Marvin Pafla || 5 || Fairness Without Demographics in Repeated Loss Minimization || [http://proceedings.mlr.press/v80/hashimoto18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization Summary]
|-
|Oct 30 || Glen Chalatov || 6 || Pixels to Graphs by Associative Embedding || [http://papers.nips.cc/paper/6812-pixels-to-graphs-by-associative-embedding Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pixels_to_Graphs_by_Associative_Embedding Summary]
|-
|Nov 1 || Sriram Ganapathi Subramanian || 7 ||Differentiable plasticity: training plastic neural networks with backpropagation || [http://proceedings.mlr.press/v80/miconi18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity Summary]
[https://wiki.math.uwaterloo.ca/statwiki/images/3/3c/Deep_learning_course_presentation.pdf Slides]
|-
|Nov 1 || Hadi Nekoei || 8 || Synthesizing Programs for Images using Reinforced Adversarial Learning || [http://proceedings.mlr.press/v80/ganin18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning Summary]
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Synthesizing_Programs_for_Images_using_Reinforced_Adversarial_Learning.pdf Slides]
|-
|Nov 1 || Henry Chen || 9 || DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks || [https://ieeexplore.ieee.org/abstract/document/7989236 Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DeepVO_Towards_end_to_end_visual_odometry_with_deep_RNN Summary]
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DeepVO_Presentation_Henry.pdf Slides]
|-
|Nov 6 || Nargess Heydari || 10 ||Wavelet Pooling For Convolutional Neural Networks Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks Summary] [https://wiki.math.uwaterloo.ca/statwiki/images/1/1a/Wavelet_Pooling_for_Convolutional_Neural_Networks.pptx Slides]
|-
|Nov 6 || Aravind Ravi || 11 || Towards Image Understanding from Deep Compression Without Decoding || [https://openreview.net/forum?id=HkXWCMbRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding Summary]
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DL_STAT946_PPT_AravindRavi.pdf Slides]
|-
|Nov 6 || Ronald Feng || 12 || Learning to Teach || [https://openreview.net/pdf?id=HJewuJWCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach Summary]
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:946_L2T_slides.pdf Slides]
|-
|Nov 8 || Neel Bhatt || 13 || Annotating Object Instances with a Polygon-RNN || [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN Summary] [https://wiki.math.uwaterloo.ca/statwiki/images/a/af/ANNOTATING_OBJECT_INSTANCES_NEEL_BHATT.pdf Slides]
|-
|Nov 8 || Jacob Manuel || 14 || Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels || [https://arxiv.org/pdf/1804.06872.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching Summary] [https://wiki.math.uwaterloo.ca/statwiki/images/3/33/Co-Teaching.pdf Slides]
|-
|Nov 8 || Charupriya Sharma|| 15 || A Bayesian Perspective on Generalization and Stochastic Gradient Descent|| [https://openreview.net/pdf?id=BJij4yg0Z Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent Summary]
|-
|NOv 13 || Sagar Rajendran || 16 || Zero-Shot Visual Imitation || [https://openreview.net/pdf?id=BkisuzWRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Zero-Shot_Visual_Imitation Summary]
|-

|Nov 13 || Ruijie Zhang || 17 || Searching for Efficient Multi-Scale Architectures for Dense Image Prediction || [https://arxiv.org/pdf/1809.04184.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Searching_For_Efficient_Multi_Scale_Architectures_For_Dense_Image_Prediction Summary]
|-
|Nov 13 || Neil Budnarain || 18 || Predicting Floor Level For 911 Calls with Neural Networks and Smartphone Sensor Data || [https://openreview.net/pdf?id=ryBnUWb0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data Summary]
|-
|NOv 15 || Zheng Ma || 19 || Reinforcement Learning of Theorem Proving || [https://arxiv.org/abs/1805.07563 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Reinforcement_Learning_of_Theorem_Proving Summary] [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:zheng_946_presentation.pdf Slides]
|-
|Nov 15 || Abdul Khader Naik || 20 || Multi-View Data Generation Without View Supervision || [https://openreview.net/pdf?id=ryRh0bb0Z Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=MULTI-VIEW_DATA_GENERATION_WITHOUT_VIEW_SUPERVISION Summary]
|-
|Nov 15 || Johra Muhammad Moosa || 21 || Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin || [https://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Attend_and_Predict:_Understanding_Gene_Regulation_by_Selective_Attention_on_Chromatin Summary] [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Attend_and_Predict.pdf Slides]
|-
|NOv 20 || Zahra Rezapour Siahgourabi || 22 ||Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias ||[https://arxiv.org/pdf/1807.07049 Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias Summary]
|-
|Nov 20 || Shubham Koundinya || 23 || Countering Adversarial Images Using Input Transformations ||[https://openreview.net/pdf?id=SyJ7ClWCb paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations Summary]
|-
|Nov 20 || Salman Khan || 24 || Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples || [http://proceedings.mlr.press/v80/athalye18a.html paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples Summary]
|-
|NOv 22 ||Soroush Ameli || 25 || Learning to Navigate in Cities Without a Map || [https://arxiv.org/abs/1804.00168 paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Navigate_in_Cities_Without_a_Map Summary]
|-
|Nov 22 ||Ivan Li || 26 || Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction || [https://arxiv.org/pdf/1802.05451v3.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mapping_Images_to_Scene_Graphs_with_Permutation-Invariant_Structured_Prediction Summary]
|-
|Nov 22 ||Sigeng Chen || 27 ||Conditional Neural Processes || [http://proceedings.mlr.press/v80/garnelo18a/garnelo18a.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=conditional_neural_process Summary]
|-
|Nov 27 || Aileen Li || 28 || Unsupervised Neural Machine Translation ||[https://openreview.net/pdf?id=Sy2ogebAW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Neural_Machine_Translation Summary]
|-
|Nov 27 ||Xudong Peng || 29 || Visual Reinforcement Learning with Imagined Goals || [https://arxiv.org/abs/1807.04742 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Visual_Reinforcement_Learning_with_Imagined_Goals Summary]
|-
|Nov 27 ||Xinyue Zhang || 30 || Policy Optimization with Demonstrations || [http://proceedings.mlr.press/v80/kang18a/kang18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations Summary]
|-
|-
|NOv 29 ||Junyi Zhang || 31 || Autoregressive Convolutional Neural Networks for Asynchronous Time Series || [http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Paper] ||
|-
|Nov 29 ||Travis Bender || 32 || Automatic Goal Generation for Reinforcement Learning Agents || [http://proceedings.mlr.press/v80/florensa18a/florensa18a.pdf Paper] ||
|-
|Nov 29 ||Patrick Li || 33 || Matrix Capsules with EM Routing || [https://openreview.net/pdf?id=HJWLfGWRb Paper] ||
|-
|Nov 30 || Jiazhen Chen || 34 || Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning || [https://arxiv.org/abs/1809.02121 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn Summary]
|-
|Nov 30 || Gaurav Sahu || 35 || TBD || ||
|-
|Nov 23 || Kashif Khan || 36 || Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-encoders Summary]
|-
|Nov 23 || Shala Chen || 37 || A Neural Representation of Sketch Drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings Summary]
|-
|Nov 30 || Ki Beom Lee || 38 || Detecting Statistical Interactions from Neural Network Weights|| [https://openreview.net/forum?id=ByOfBggRZ Paper] ||
|-
|Nov 23 || Wesley Fisher || 39 || Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling || [http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling Summary]
|-
||Nov 30|| Ahmed Afify || 40 ||Don't Decay the Learning Rate, Increase the Batch Size || [https://openreview.net/pdf?id=B1Yy1BxCZ Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON'T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE Summary]
|

Unsupervised Neural Machine Translation

2018-11-21T02:25:42Z

Lwali:

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T02:25:12Z

Lwali:

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka & Eneko Agirre.

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T02:23:21Z

Lwali: /* Experiments and Results */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

File:Table1 lwali.png

2018-11-21T02:23:05Z

Lwali:

File:Table1.png

2018-11-21T02:22:31Z

Lwali: Lwali uploaded a new version of File:Table1.png

Unsupervised Neural Machine Translation

2018-11-21T02:22:16Z

Lwali: /* Methodology */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T02:22:00Z

Lwali: /* Word Embedding Alignment */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

File:Figure1 lwali.png

2018-11-21T02:21:07Z

Lwali:

File:Figure2 lwali.png

2018-11-21T02:20:45Z

Lwali:

File:Figure2.png

2018-11-21T02:19:30Z

Lwali: Lwali uploaded a new version of File:Figure2.png

File:Figure1.png

2018-11-21T02:16:23Z

Lwali: Lwali uploaded a new version of File:Figure1.png

Unsupervised Neural Machine Translation

2018-11-21T02:12:59Z

Lwali: /* Word Embedding Alignment */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T02:12:19Z

Lwali: /* Critique */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. However it is a great step in this direction.

The results showed that the proposed system performed far worse than state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T01:41:56Z

Lwali: /* Methodology */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T01:41:41Z

Lwali: /* References */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.

#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.

#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T01:40:14Z

Lwali: /* Word Embedding Alignment */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===

There have been signifiant work in statistical deciphering technique to induce a machine translation model from monolingual data. These techniques treat the source language as ciphertext and models the distribution of the ciphertext.

There are also proposals that use techiniques other than direct parallel corpora to do machine translation. Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. Other works use monolingual data in combination with scarce parallel corpora.

The most important contribution to training a NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directors and teach each other through reinforcement learning. However this approach still required a large parallel corpus for a warm start.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.

#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.

#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T01:25:45Z

Lwali: /* Methodology */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.

#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.

#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T01:25:22Z

Lwali: /* Methodology */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.

#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.

#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T01:22:10Z

Lwali: /* Unsupervised */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.

#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.

#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T00:45:11Z

Lwali: /* Methodology */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.

#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.

#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in background.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-21T00:14:56Z

Lwali: /* Introduction */

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual Supvervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however in reality there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T23:56:48Z

Lwali: /* Word Embedding Alignment */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T23:51:18Z

Lwali: /* Word Embedding Alignment */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows an example of aligning the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T23:16:55Z

Lwali: /* Experiments and Results */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

[[File:Table1.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

===Qualitative Analysis===

[[File:Table2.png|600px|center]]

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T23:15:58Z

Lwali: /* Methodology */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

[[File:Figure1.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T23:15:04Z

Lwali: /* Word Embedding Alignment */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French.

[[File:Figure2.png|frame|400px|center|Figure 2: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.]]

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

File:Table2.png

2018-11-20T23:10:32Z

Lwali:

File:Table1.png

2018-11-20T23:10:25Z

Lwali: Lwali uploaded a new version of File:Table1.png

File:Figure2.png

2018-11-20T23:10:17Z

Lwali: Lwali uploaded a new version of File:Figure2.png

File:Figure1.png

2018-11-20T23:10:08Z

Lwali: Lwali uploaded a new version of File:Figure1.png

Unsupervised Neural Machine Translation

2018-11-20T23:09:19Z

Lwali: /* Conclusions and Future Work */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

*insert Figure2

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings in the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T22:49:19Z

Lwali: /* Experiments and Results */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

*insert Figure2

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T22:49:02Z

Lwali: /* Other Sources */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

*insert Figure2

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

= Critique =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T22:48:50Z

Lwali: /* Results */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

*insert Figure2

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

= Critique =
#

= Other Sources =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T22:48:14Z

Lwali: /* Experiments and Results */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

*insert Figure2

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

=Results=

= Critique =
#

= Other Sources =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T22:47:42Z

Lwali: /* Unsupervised */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

*insert Figure2

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

#'''Supervised''': This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

=Results=

= Critique =
#

= Other Sources =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."

Unsupervised Neural Machine Translation

2018-11-20T22:47:28Z

Lwali: /* Results */

= Introduction =
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as there are a large number of languages that lack parallel pairing, e.g. for German-Russian.

The general approach of the methodology is to:

# Using monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an auto-encoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monoligual corpora to vector enbeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so in theory there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 2 shows the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.

*insert Figure2

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary at each iteration.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The words are then converted to word embeddings using word2vec with 300 dimensions, and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.

*insert Figure1*

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (groud-truth) translation.

The paper runs the translation model under 3 different settings to compare the performance (Table 1):

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper also adds each component piece-wise when doing evaluation to test the impact each piece has on the final score. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations.

As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detracts in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

#'''Supervised''': This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently it was trained without denoising and back-translation. The proposed model under supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.

*Insert Table1*

===Qualitative Analysis===

*Insert Table2*

Table 2 shows 4 examples of French to English translations. Example 1 and 2 show that the model is able to model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high quality translations of long and more complex sentences. However in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures.

=Conclusions and Future Work=

=Results=

= Critique =
#

= Other Sources =
#

= References =
#'''[Mikolov, 2013]'''Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."

#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".

#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."

#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."