on using very large target vocabulary for neural machine translation
Overview
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation" <ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. "On Using Very Large Target Vocabulary for Neural Machine Translation", 2015.</ref> The paper presents the application of importance sampling for neural machine translation with very large target vocabulary. Despite the advantages of neural networks in translation over the statistical machine translation systems, such as the phrasebased system, they suffer from some technical problems. Most importantly they are limited to work with a small vocabulary data set because of complexity and number of parameters have to be trained. The performance of the current neural nets are rapidly decrease if the number of unidentified words in this target vocabulary increase. In this paper Jean and his colleagues proposed a method of training based on the importance sampling which can uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrate better performance without losing the efficiency in time or speed.
Methods
Recall that the classic neural machine learning plays as encoderdecoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where [math]h_t=f(x_t,h_{t1})[/math]. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: [math]p(y_t\,\,y_{\lt t},x)\propto exp\{q(y_{t1}, z_t, c_t)\}[/math] where [math]z_t=g(y_{t1}, z_{t1}, c_t)[/math] and [math]c_t=r(z_{t1}, h_1,..., H_T)[/math]
The objective function which have to be maximized represented by [math]\theta=argmax\sum_{n=1}^{N}\sum_{t=1}^{T_n}logp(y_t^n\,\,y_{\lt t}^n, x^n)[/math]
where [math](x^n, y^n)[/math] is the nth training pair of sentence, and [math]T_n[/math] is the length of nth target sentence [math]y^n[/math]. The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref> Bahdanau et al.,NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, 2014 </ref>. In that the encoder is implemented by a bidirectional recurrent neural network,[math]h_t=[h_i^\leftarrow; h_t^\rightarrow][/math]. The decoder, at each time, computes the context vector ct as a convex sum of the hidden states (h1, . . . , hT ) with the coefficients (α1, . . . , αT) computed by
[math]\alpha_t=\frac{exp\{a(h_t, z_t)\}}{\sum_{k}exp\{a(h_t, z_t)\}}[/math] where a is a feedforward neural network with a single hidden layer. Then the probability of the next target word is
[math]p(y_t\ y_{\lt t}, x)=\frac{1}{Z} exp\{W_t^T\phi(y_{t1}, z_t, c_t)+b_t\}[/math]. In that [math]\phi[/math] is an affine transformation followed by a nonlinear activation, [math]w_t[/math] and [math]b_t[/math] are the target word vector and the target word bias, respectively. Z is the normalization constant computed by
[math] Z=\sum_{k:y_k\in V}exp\{W_t^T\phi(y_{t1}, z_t, c_t)+b_t[/math] where V is set of all the target words.
The dot product between the feature [math]\phi(y_{t1}, z_t, c_t)[/math] and [math]w_t[/math] is required to be done for all words in target vocabulary that is computationally complex and time consuming.
The approach of this paper uses only a subset of sampled target words as a align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the Kmost frequent words (according to the estimated unigram probability) and, using the dictionary, at most [math]k\prime[/math] likely target words for each source word. K and [math]k\prime[/math] may be chosen either to meet the computational requirement or to maximize the translation performance on the development set.
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset [math]v\prime[/math] of the target vocabulary at each update<ref>
Bengio and Sen´ et al, Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ,IEEEXplor, 2008
</ref>.
Let us consider the gradient of the log probability of the output in conditional probability of [math]y_t[/math]. The gradient is composed of a positive and negative part:
[math]\bigtriangledown=logp(y_tY_{\lt t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)\sum_{k:y_k\in V} p(y_ky_{\lt t}, x) \bigtriangledown \mathbf\varepsilon(y_t) [/math]
where the energy [math]\mathbf\varepsilon[/math] is defined as [math]\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j1}, Z_j, C_j)+b_j[/math]. The second term of gradiant is in essence the expected gradiant of the energy as [math]\mathbb E_P[\bigtriangledown \epsilon(y)][/math] where P denotes [math]p(yy_{\lt t}, x)[/math].
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set [math]v\prime[/math] of samples from Q, we approximate the expectation with
[math]\mathbb E_P[\bigtriangledown \epsilon(y)][/math] where P denotes [math]p(yy_{\lt t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)[/math] where [math]w_k=exp{\epsilon(y_k)log Q(y_k)}[/math]
In practice, the training corpus is partitioned and a subset [math]v\prime[/math] of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached. 0 In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translationspecific technique based on the aligned source word. The authors in the experiment, replaced each Un token with the aligned source word or its most likely translation determined by another word alignment model. The proposed approach was evaluated in English>French and EnglishGerman translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for EnglishGerman were Europarl v7, Common Crawl, News Commentary. The models were evaluated on the WMT’14 test set (newstest 2014)3 , while the concatenation of newstest2012 and newstest2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.
Setting
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and also another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the dataset at the beginning of each epoch. While this causes a nonnegligible amount of overhead, such a change allows words to be contrasted with different sets of other words each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They test using a bilingual dictionary to accelerate decoding and to replace unknown words in translations.
Results
The results for English> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.
Method  RNNsearch  RNNsearchLV  Phrasebased SMT (cHO et al)  Phrasebased SMT (Durrani et al)  

BASIC NMT  29.97 (26.58)  32.68 (28.76)  30.6  33.3  37.03 
+ Candidate List
+ UNK Replace 
33.08 (29.08)  33.36 (29.32)
34.11 (29.98) 

33.1 
33.3  37.03 
+ Reshuffle (tau=50)    34.6 (30.53)    33.3  37.03 
+ Ensemble    37.19 (31.98)  37.5  33.3  3703 
And the results for English>German translation in Table below.
Method  RNNsearch  RNNsearchLV  Phrasebased SMT 

BASIC NMT  16.46 (17.13)  16.95 (17.85)  20.67 
+ Candidate List
+ UNK Replace 
18.97 (19.16)  17.46 (18.00)
18.89 (19.03) 
20.67 
+ Reshuffle (tau=50)    19.4  20.67 
+ Ensemble    21.59  20.67 
It is clear that the RNNsearchLV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearchLV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translationspecific techniques. With these, however, the RNNsearchLV outperformed it. The performance of the RNNsearchLV is also better than that of a standard phrasebased translation system. For English→German, the RNNsearchLV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary singlemodel performance is achieved by reshuffling the dataset. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearchLV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.
The timing information of decoding for different models were presented in Table 3. While decoding from RNNsearchLV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used. The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so. The authors found that K is inversely correlated with t. The singlemodel test BLEU scores for English→French with respect to the number of dictionary entries [math]k\prime[/math] allowed for each source word is presented in figure below
Conclusion
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the stateoftheart translation systems on both the English→French task and English→German task. On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.
Bibliography
<references />