on using very large target vocabulary for neural machine translation

From statwiki
Revision as of 19:38, 19 November 2015 by Mgohari2 (talk | contribs)
Jump to: navigation, search


This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation". The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrase-based system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.


Recall that the classic neural machine learning plays as encoder-decoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where [math]h_t=f(x_t,h_{t-1})[/math]. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: [math]p(y_t\,|\,y_{\lt t},x)\propto exp\{q(y_{t-1}, z_t, c_t)\}[/math] where [math]z_t=g(y_{t-1}, z_{t-1}, c_t)[/math] and [math]c_t=r(z_{t-1}, h_1,..., H_T)[/math]

The objective function which have to be maximized represented by [math]\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,|\,y_{\lt t}^n, x^n)[/math]

where [math](x^n, y^n)[/math] is the n-th training pair of sentence, and [math]T_n[/math] is the length of n-th target sentence [math]y^n[/math]. The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref> Bahdanau et al., 2014 </ref>. In that the encoder is implemented by a bi-directional recurrent neural network,[math]h_t=[h_i^\leftarrow; h_t^\rightarrow][/math]. The decoder, at each time, computes the context vector ct as a convex sum of the hidden states (h1, . . . , hT ) with the coefficients (α1, . . . , αT) computed by

[math]\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}[/math] where a is a feedforward neural network with a single hidden layer. Then the probability of the next target word is

[math]p(y_t\ y_{\lt t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t\}[/math]. In that [math]\phi[/math] is an affine transformation followed by a nonlinear activation, [math]w_t[/math] and [math]b_t[/math] are the target word vector and the target word bias, respectively. Z is the normalization constant computed by

[math] Z=\sum_{k:y_k\in V}exp\{W_t^T\phi(y_{t-1}, z_t, c_t)+b_t[/math] where V is set of all the target words.

The dot product between the feature [math]\phi(y_{t-1}, z_t, c_t)[/math] and [math]w_t[/math] is required to be done for all words in target vocabulary that is computationally complex and time consuming. The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the K-most frequent words (according to the estimated unigram probability) and, using the dictionary, at most [math]k\prime[/math] likely target words for each source word. K and [math]k\prime[/math] may be chosen either to meet the computational requirement or to maximize the translation performance on the development set. In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset [math]v\prime[/math] of the target vocabulary at each update<ref> Bengio and Sen´ ecal, 2008 </ref>. Let us consider the gradient of the log probability of the output in conditional probability of [math]y_t[/math]. The gradient is composed of a positive and negative part:

[math]\bigtriangledown=logp(y_t|Y_{\lt t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)-\sum_{k:y_k\in V} p(y_k|y_{\lt t}, x) \bigtriangledown \mathbf\varepsilon(y_t) [/math] where the energy [math]\mathbf\varepsilon[/math] is defined as [math]\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j-1}, Z_j, C_j)+b_j[/math]. The second term of gradiant is in essence the expected gradiant of the energy as [math]\mathbb E_P[\bigtriangledown \epsilon(y)][/math] where P denotes [math]p(y|y_{\lt t}, x)[/math]. The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set [math]v\prime[/math] of samples from Q, we approximate the expectation with

[math]\mathbb E_P[\bigtriangledown \epsilon(y)][/math] where P denotes [math]p(y|y_{\lt t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)[/math] where [math]w_k=exp{\epsilon(y_k)-log Q(y_k)}[/math]

In practice, the training corpus is partitioned and a subset [math]v\prime[/math] of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. 0 In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translation-specific technique based on the aligned source word. The authors in the experiment, replaced each Un token with the aligned source word or its most likely translation determined by another word alignment model. The proposed approach was evaluated in English->French and English-German translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for English-German were Europarl v7, Common Crawl, News Commentary. The models were evaluated on the WMT’14 test set (news-test 2014)3 , while the concatenation of news-test-2012 and news-test-2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.


As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a non-negligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.


The results for English-> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.

Method RNNsearch RNNsearch-LV Google Phrase-based SMT (cHO et al) Phrase-based SMT (Durrani et al)
BASIC NMT 29.97 (26.58) 32.68 (28.76) 30.6 33.3 37.03
+ Candidate List

+ UNK Replace

33.08 (29.08) 33.36 (29.32)

34.11 (29.98)



33.3 37.03
+ Reshuffle (tau=50) - 34.6 (30.53) - 33.3 37.03
+ Ensemble - 37.19 (31.98) 37.5 33.3 3703

And the results for English->German translation in Table below.

Method RNNsearch RNNsearch-LV Phrase-based SMT
BASIC NMT 16.46 (17.13) 16.95 (17.85) 20.67
+ Candidate List

+ UNK Replace

18.97 (19.16) 17.46 (18.00)

18.89 (19.03)

+ Reshuffle (tau=50) - 19.4 20.67
+ Ensemble - 21.59 20.67

It is clear that the RNNsearch-LV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearch-LV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearch-LV outperformed it. The performance of the RNNsearch-LV is also better than that of a standard phrase-based translation system. For English→German, the RNNsearch-LV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary single-model performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearch-LV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.

The timing information of decoding for different models were presented in Table 3. While decoding from RNNsearch-LV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so. The authors found that K is inversely correlated with t. The single-model test BLEU scores for English→French with respect to the number of dictionary entries [math]k\prime[/math] allowed for each source word is presented in figure below


Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the state-of-the-art translation systems on both the English→French task and English→German task. On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.