Unsupervised Machine Translation Using Monolingual Corpora Only
Introduction
The paper presents an unsupervised method to machine translation using only monoligual corpora without any alignment between sentences or documents. Monoligual corpora are text corpora that is made up of one language only. This contrasts with the usual translation approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences.
The general approach of the methodology is to first use a unsupervised word-by-word translation model proposed by [Conneau, 2017], then iteratively improve on the translation by utilizing 2 architectures:
- A denoising auto-encoder to reconstruct noisy versions of sentences for both source and target languages.
- A discriminator to align the distributions of the source and target languages in a latent space.
Background
Methodology
The objective function that proposed by the paper is a combination of 3 component objective functions:
- Reconstruction loss of the denoising auto-encoder
- Cross domain loss of the auto-encoder
- Adversarial cross entropy loss of the discriminator
Notations
[math]\displaystyle{ \mathcal{W}_S, \mathcal{W}_T }[/math] are the sets of words in the source language domain.
[math]\displaystyle{ \mathcal{Z}^S , \mathcal{Z}^T }[/math] are the sets of word embeddings in the source and target language domain.
[math]\displaystyle{ \ell \in \{src, tgt\} }[/math] denote the source or target language
[math]\displaystyle{ x \in \mathbb{R}^m }[/math] is a vector of m words in a particular language [math]\displaystyle{ \ell }[/math]
[math]\displaystyle{ e_{\theta_{enc},\mathcal{Z}}(x, \ell) }[/math] is the encoder, it takes as input [math]\displaystyle{ x }[/math] and [math]\displaystyle{ \ell }[/math] and computes [math]\displaystyle{ z \in \mathbb{R}^m }[/math], which is a sequence of m hidden states using embedding [math]\displaystyle{ \mathcal{Z}^{\ell} }[/math]
[math]\displaystyle{ d_{\theta_{dec},\mathcal{Z}}(z, \ell) }[/math] is the decoder, it takes as input [math]\displaystyle{ z }[/math] and [math]\displaystyle{ \ell }[/math] and computes [math]\displaystyle{ y \in \mathbb{R}^k }[/math], which a sequence of k words from vocabulary [math]\displaystyle{ \mathcal{W}^{\ell} }[/math]
Noise Model
The Noise model used throughout the paper [math]\displaystyle{ C(x) }[/math] is a randomly sampled noisy version of sentence [math]\displaystyle{ x }[/math]. Noise is added in 2 ways:
- Randomly dropping each word in the sentence with probability [math]\displaystyle{ p_{wd} }[/math].
- Slightly shuffling the words in the sentence where each word can be at most [math]\displaystyle{ k }[/math] positions away from its original position.
The authors found in practice [math]\displaystyle{ p_{wd}= 0.1 }[/math] and [math]\displaystyle{ k=3 }[/math] to be good parameters.
Reconstruction Loss
This component captures the expected cross entropy loss between [math]\displaystyle{ \hat{x} }[/math] and [math]\displaystyle{ x }[/math], where [math]\displaystyle{ \hat{x} }[/math] is constructed as follows:
- Construct [math]\displaystyle{ C(x) }[/math], noisy version of [math]\displaystyle{ x }[/math] from a language [math]\displaystyle{ \ell }[/math]
- Input [math]\displaystyle{ C(x) }[/math] and language [math]\displaystyle{ \ell }[/math] into the encoder parameterized with [math]\displaystyle{ \theta_{enc} }[/math]
- Input the output from previous step and [math]\displaystyle{ \ell }[/math] into the decoder parameterized with [math]\displaystyle{ \theta_{dec} }[/math]
\begin{align} \mathcal{L}_{auto}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell) = E_{x\sim D_\ell, \hat{x}\sim d(e(C(x),\ell),\ell)}[\Delta(\hat{x},x)] \end{align}
Cross Domain Training
This component captures the expected cross entropy loss between [math]\displaystyle{ \hat{x} }[/math] and [math]\displaystyle{ x }[/math], where [math]\displaystyle{ \hat{x} }[/math] is constructed as follows:
- Using the current iteration of the translation model [math]\displaystyle{ M }[/math], construct translation [math]\displaystyle{ M(x) }[/math] in [math]\displaystyle{ \ell_2 }[/math], where [math]\displaystyle{ x }[/math] is from a language [math]\displaystyle{ \ell_1 }[/math]. (Initialization of M is using a different translation model discussed later)
- Construct [math]\displaystyle{ C(M(x)) }[/math], noisy version of translation [math]\displaystyle{ M(x) }[/math].
- Input [math]\displaystyle{ C(M(x)) }[/math] and language [math]\displaystyle{ \ell_2 }[/math] into the encoder parameterized with [math]\displaystyle{ \theta_{enc} }[/math]
- Input the output from previous step and [math]\displaystyle{ \ell_1 }[/math] into the decoder parameterized with [math]\displaystyle{ \theta_{dec} }[/math]
\begin{align} \mathcal{L}_{cd}(\theta_{enc}, \theta_{dec}, \mathcal{Z}, \ell_1,\ell_2) = E_{x\sim D_\ell_1, \hat{x}\sim d(e(C(M(x)),\ell_2),\ell_1)}[\Delta(\hat{x},x)] \end{align}
Adversarial Training
Critique
Other Sources
References
- [Lample, 2018] Lample, G., Conneau, A., Ranzato, M., Denoyer, L., "Unsupervised Machine Translation Using Monolingual Corpora Only". arXiv:1711.00043
- [Conneau, 2017] Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H., "Word Translation without Parallel Data". arXiv:1710.04087