1. liu, wenqing

# Introduction

Generative adversarial networks(GANs)(Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.

A persisting challenge challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target. (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.

In this paper, we propose a novel weight normalization method called spectral normalization that can stabilize the training of discriminator networks. Our normalization enjoys following favorable properties. In this study, we provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.

# Model

Let us consider a simple discriminator made of a neural network of the following form, with the input x: $f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))$ where $\theta:=W^1,\cdots,W^L, W^{L+1}$ is the learning parameters set, $W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L}$, and $a_l$ is an element-wise non-linear activation function. The final output of the discriminator function is given by $D(x,\theta) = A(f(x,\theta))$. The standard formulation of GANs is given by $\min_{G}\max_{D}V(G,D)$ where min and max of G and D are taken over the set of generator and discriminator functions, respectively. The conventional form of $V(G,D)$ is given by $E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x')$ where $q_{data}$ is the data distribution and $p_G(x)$ is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of $V(G,D)$ is given by $D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x))$. We search for the discriminator D from the set of K-lipshitz continuous functions, that is, $\arg\max_{||f||_{Lip}\le k}V(G,D)$, where we mean by $||f||_{lip}$ the smallest value M such that $||f(x)-f(x')||/||x-x'||\le M$ for any x,x', with the norm being the $l_2$ norm. Our spectral normalization controls the Lipschitz constant of the discriminator function $f$ by literally constraining the spectral norm of each layer $g: h_{in}\rightarrow h_{out}$. By definition, Lipschitz norm $||g||_{Lip}$ is equal to $\sup_h\sigma(\nabla g(h))$, where $\sigma(A)$ is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer $g(h)=Wh$, the norm is given by $||g||_{Lip}=\sigma(W)$. Observing the following bound:

$||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip}$

Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint $\sigma(W)=1$:

$\bar{W_{SN}}:= W/\sigma(W)$

In summary, just like what weight normalization does, we reparameterize weight matrix $\bar{W_{SN}}$ as $W/\sigma(W)$ to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:

$\frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W}$

$\frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}})$

where $E_{ij}$ is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and $u_1, v_1$ are respectively the first left and right singular vectors of W.

# Spectral Normalization VS Other Regularization Techniques

The weight normalization introduced by Salimans & Kingma(2016) is a method that normalizes the $l_2$ norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization $\bar{W_{WN}}$:

$\sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0)$ where $\sigma_t(A)$ is a t-th singular value of matrix A.

Note, if $\bar{W_{WN}}$ is the weight normalized matrix of dimension $d_i*d_0$, the norm $||\bar{W_{WN}}h||_2$ for a fixed unit vector $h$ is maximized at $||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T$ which means that $\bar{W_{WN}}$ is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of $\bar{W_{WN}}h$ large. For weight normalization, however, this comes at hte cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features.

Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al.(2016) augmented the adversarial objective function by adding the following term:

$||W^TW-I||^2_F$

While this seems to serve the same purpose as spectral normalization, orthonormal regularization are mathematically quite different from our spectral normalization because the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one.

Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e $||\nabla_{\hat{x}}f||_2 = 1$) at discrete sets of points of the form $\hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x$ generated by interpolating a sample $\tilde{x}$

# More Formulations of Recurrent Neural Networks

The standard RNN is formalized as follows

$\,h_t=\tanh(W_{hx}x_t+W_{hh}h_{t-1}+b_h)$
$\,o_t=W_{oh}h_t+b_o$

Given sequence of input vectors $\,(x_1,\cdots,x_{T})$, the RNN computes a sequence of hidden states $\,(h_1,\cdots,h_{T})$ and a sequence of output $\,(o_1,\cdots,o_{T})$ by iterating the above equations. $\,W_{hx}$ is the input to hidden weight matrix, $\,W_{hh}$ is the hidden to hidden weight matrix, $\,W_{oh}$ is the hidden to output weight matrix. Vector $\,b_{h}$ and $\,b_{o}$ are the biases. When t=1, the undefined $\,W_{hh}h_{t-1}$ is replace with a special initial bias vector, $\,h_{init}$.

It may seem to train RNNs with gradient descent, but in reality, gradient decays exponentially as it is backpropagated through time. The relation between parameter and dynamics of the RNN is highly unstable, which makes gradient descent ineffective. Thus, it argues that RNN can not learn long-range temporal dependencies when gradient descent is used for training. A good way to deal with inability of gradient descent to learn long-range temporal structure in RNN is known as "Long-Short Term memory". (http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)

There are different variants of LSTM<ref name=grave> </ref><ref> Gers, Felix, and Jürgen Schmidhuber. "Recurrent nets that time and count." Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 3. IEEE, 2000. </ref><ref> Cho, Kyunghyun, et al. "Learning phrase representations using rnn encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). </ref> other than the original one proposed by Hochreiter et al.<ref name=lstm> </ref> Greff et al. compare the performance of some different popular variants in their work<ref> Greff, Klaus, et al. "LSTM: A Search Space Odyssey." arXiv preprint arXiv:1503.04069 (2015). </ref> and draw the conclusion that they are about the same. While Jozefowicz, et al. suggest that some architecture can perform better than LSTM on certain tasks<ref> Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. "An Empirical Exploration of Recurrent Network Architectures." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015. </ref>.

# Criticisms

There is some concern regarding whether this model will be able to provide a truly scalable solution to MT. In particular, it is not obvious that this model will be able to sufficiently scale to long sentences as is evident in the reported results. The model is severely limited, in general, by working only in the absence of infrequent words. These theoretical limitations alongside sparse experimental results give rise to skepticism about the overarching validity of the model.

# Source

Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27 3104–3112 (2014). <references />