1. liu, wenqing

# Introduction

Generative adversarial networks(GANs)(Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.

A persisting challenge challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target. (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.

In this paper, we propose a novel weight normalization method called spectral normalization that can stabilize the training of discriminator networks. Our normalization enjoys following favorable properties. In this study, we provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.

# Model

Let us consider a simple discriminator made of a neural network of the following form, with the input x: $f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))$ where $\theta:=W^1,\cdots,W^L, W^{L+1}$ is the learning parameters set, $W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L}$, and $a_l$ is an element-wise non-linear activation function. The final output of the discriminator function is given by $D(x,\theta) = A(f(x,\theta))$. The standard formulation of GANs is given by $\min_{G}\max_{D}V(G,D)$ where min and max of G and D are taken over the set of generator and discriminator functions, respectively. The conventional form of $V(G,D)$ is given by $E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x')$ where $q_{data}$ is the data distribution and $p_G(x)$ is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of $V(G,D)$ is given by $D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x))$. We search for the discriminator D from the set of K-lipshitz continuous functions, that is, $\arg\max_{||f||_{Lip}\le k}V(G,D)$, where we mean by $||f||_{lip}$ the smallest value M such that $||f(x)-f(x')||/||x-x'||\le M$ for any x,x', with the norm being the $l_2$ norm. Our spectral normalization controls the Lipschitz constant of the discriminator function $f$ by literally constraining the spectral norm of each layer $g: h_{in}\rightarrow h_{out}$. By definition, Lipschitz norm $||g||_{Lip}$ is equal to $\sup_h\sigma(\nabla g(h))$, where $\sigma(A)$ is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer $g(h)=Wh$, the norm is given by $||g||_{Lip}=\sigma(W)$. Observing the following bound:

$||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip}$

Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint $\sigma(W)=1$:

$\bar{W_{SN}}:= W/\sigma(W)$

In summary, just like what weight normalization does, we reparameterize weight matrix $\bar{W_{SN}}$ as $W/\sigma(W)$ to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:

$\frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W}$

$\frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}})$

where $E_{ij}$ is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and $u_1, v_1$ are respectively the first left and right singular vectors of W.

# Training and Results

### Training Method

Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and $\,v$ can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08. LSTM does not suffer from the vanishing gradient problem, but it can be affected by exploding gradients which is taken into account by enforcing a hard constraint on the norm of the gradient.

### Scoring Method

Scoring was done using the BLEU (Bilingual Evaluation Understudy) metric. This is an algorithm created for evaluating the quality of machine translated text. This is done by using a modified form of precision to compare a produced translation against a set of reference translations. This metric tends to correlate well with human judgement across a corpus, but performs badly if used to evaluate individual sentences. More information can be found in the BLEU paper and the wikipedia article. These resources both state that the BLEU score is a number between 0 and 1, with closer to 1 corresponding to a better translation. The LSTM paper reports BLEU scores in percentage values.

### Results

The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.

Active Voice: I ate an apple.

Passive Voice: The apple was eaten by me.

An interesting result is the fact that reversing the source sentences (not test sentences) improved the long sentence decoding, which in turn increased the BLEU score from 25.9 to 30.6. While the authors do not have a complete explanation, they theorize the improvement in performance is due to the introduction of many short term dependencies to the data-set, by reversing the source sentences they minimize the time lag between the end of the source and the start of the target sentence. This reduction in the time lag is what the authors believe help the LSTM architecture establish a link between source and target and utilize the memory feature of the network better. Note, that the mean time lag does not change. Given the input sequence $(x_1, \dots, x_T)$ and the target sequence $(y_1, \dots, y_T)$, the sequence of time lags is $\Delta t = (T, \dots, T)$ and $\frac{1}{T} \sum_{\Delta t_i} T = T$. If, however, the input is reversed, the sequence of time lags of corresponding inputs is $\Delta t = (1, 3, \dots, 2T - 1)$ which still has a mean time lag of $\frac{1}{T} \sum_{\Delta t_i} (2i - 1) = \frac{1}{T} \sum_{i = 1}^{T/2} (2i + 2(T-i)) = T$ (assuming even T, but odd T can be shown similarly). Thus, half of the time lags are shorter with the reversed input sequence.

For example, let "I saw the man" be the source sentence, "with the binoculars" be the target sentence, if we concatenate both source and target sentences we have "I saw the man with the binoculars". By reversing the source sentence ("man the saw I") the subject "man" is now closer to the context target "binoculars", compared to if the source sentence is not reversed. Reversing the input sentences leads to more confident predictions in the early parts of the target sentence and to less confident predictions in the later parts. Also, it results in LSTMs with better memory utilization.

In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].

### Some developments of LSTM

Dealing with the very rare words are a challenge for the LSTM. A weakness of LSTM is its inability to (correctly) translate the very rare words that comes up with out-of-vocabulary words, i.e. no translation. The long short-term dependencies that are induced in this method can lead to less likelihood for a long sequence words come after the unknown words. In the other hand, these untranslated words impose a longer temporal memory that decrease the overall efficiency of network. Sutskever I ([1]) suggested a method to address the rare word problem. They assumed that if the origin of any undefined word is known then the word can be look up by introducing a post-processing step. This step would replace each unknown word in the system’s output with a translation of its source word. They proposed three strategies to track the source and translate it using either a dictionary or the identity translation.

### Open questions

The results of the paper pose some interesting questions which are not discussed in the paper itself:

1. Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).
2. For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?

# More Formulations of Recurrent Neural Networks

The standard RNN is formalized as follows

$\,h_t=\tanh(W_{hx}x_t+W_{hh}h_{t-1}+b_h)$
$\,o_t=W_{oh}h_t+b_o$

Given sequence of input vectors $\,(x_1,\cdots,x_{T})$, the RNN computes a sequence of hidden states $\,(h_1,\cdots,h_{T})$ and a sequence of output $\,(o_1,\cdots,o_{T})$ by iterating the above equations. $\,W_{hx}$ is the input to hidden weight matrix, $\,W_{hh}$ is the hidden to hidden weight matrix, $\,W_{oh}$ is the hidden to output weight matrix. Vector $\,b_{h}$ and $\,b_{o}$ are the biases. When t=1, the undefined $\,W_{hh}h_{t-1}$ is replace with a special initial bias vector, $\,h_{init}$.

It may seem to train RNNs with gradient descent, but in reality, gradient decays exponentially as it is backpropagated through time. The relation between parameter and dynamics of the RNN is highly unstable, which makes gradient descent ineffective. Thus, it argues that RNN can not learn long-range temporal dependencies when gradient descent is used for training. A good way to deal with inability of gradient descent to learn long-range temporal structure in RNN is known as "Long-Short Term memory". (http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)

There are different variants of LSTM<ref name=grave> </ref><ref> Gers, Felix, and Jürgen Schmidhuber. "Recurrent nets that time and count." Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 3. IEEE, 2000. </ref><ref> Cho, Kyunghyun, et al. "Learning phrase representations using rnn encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). </ref> other than the original one proposed by Hochreiter et al.<ref name=lstm> </ref> Greff et al. compare the performance of some different popular variants in their work<ref> Greff, Klaus, et al. "LSTM: A Search Space Odyssey." arXiv preprint arXiv:1503.04069 (2015). </ref> and draw the conclusion that they are about the same. While Jozefowicz, et al. suggest that some architecture can perform better than LSTM on certain tasks<ref> Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. "An Empirical Exploration of Recurrent Network Architectures." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015. </ref>.

# Criticisms

There is some concern regarding whether this model will be able to provide a truly scalable solution to MT. In particular, it is not obvious that this model will be able to sufficiently scale to long sentences as is evident in the reported results. The model is severely limited, in general, by working only in the absence of infrequent words. These theoretical limitations alongside sparse experimental results give rise to skepticism about the overarching validity of the model.

# Source

Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27 3104–3112 (2014). <references />