1. liu, wenqing

Introduction

Generative adversarial networks(GANs)(Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.

A persisting challenge challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target. (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.

In this paper, we propose a novel weight normalization method called spectral normalization that can stabilize the training of discriminator networks. Our normalization enjoys following favorable properties. In this study, we provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.

Model

Let us consider a simple discriminator made of a neural network of the following form, with the input x: $f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))$ where $\theta:=W^1,\cdots,W^L, W^{L+1}$ is the learning parameters set, $W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L}$, and $a_l$ is an element-wise non-linear activation function. The final output of the discriminator function is given by $D(x,\theta) = A(f(x,\theta))$. The standard formulation of GANs is given by $\min_{G}\max_{D}V(G,D)$ where min and max of G and D are taken over the set of generator and discriminator functions, respectively. The conventional form of $V(G,D)$ is given by $E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x')$ where $q_{data}$ is the data distribution and $p_G(x)$ is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of $V(G,D)$ is given by $D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x))$. We search for the discriminator D from the set of K-lipshitz continuous functions, that is, $\arg\max_{||f||_{Lip}\le k}V(G,D)$, where we mean by $||f||_{lip}$ the smallest value M such that $||f(x)-f(x')||/||x-x'||\le M$ for any x,x', with the norm being the $l_2$ norm. Our spectral normalization controls the Lipschitz constant of the discriminator function $f$ by literally constraining the spectral norm of each layer $g: h_{in}\rightarrow h_{out}$. By definition, Lipschitz norm $||g||_{Lip}$ is equal to $\sup_h\sigma(\nabla g(h))$, where $\sigma(A)$ is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer $g(h)=Wh$, the norm is given by $||g||_{Lip}=\sigma(W)$. Observing the following bound:

$||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip}$

Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint $\sigma(W)=1$:

$\bar{W_{SN}}:= W/\sigma(W)$

In summary, just like what weight normalization does, we reparameterize weight matrix $\bar{W_{SN}}$ as $W/\sigma(W)$ to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:

$\frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W}$

$\frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}})$

where $E_{ij}$ is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and $u_1, v_1$ are respectively the first left and right singular vectors of W.

Spectral Normalization VS Other Regularization Techniques

The weight normalization introduced by Salimans & Kingma(2016) is a method that normalizes the $l_2$ norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization $\bar{W_{WN}}$:

$\sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0)$ where $\sigma_t(A)$ is a t-th singular value of matrix A.

Note, if $\bar{W_{WN}}$ is the weight normalized matrix of dimension $d_i*d_0$, the norm $||\bar{W_{WN}}h||_2$ for a fixed unit vector $h$ is maximized at $||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T$ which means that $\bar{W_{WN}}$ is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of $\bar{W_{WN}}h$ large. For weight normalization, however, this comes at hte cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features.

Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al.(2016) augmented the adversarial objective function by adding the following term:

$||W^TW-I||^2_F$

While this seems to serve the same purpose as spectral normalization, orthonormal regularization are mathematically quite different from our spectral normalization because the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one.

Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e $||\nabla_{\hat{x}} f ||_2 = 1$) at discrete sets of points of the form $\hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x$ generated by interpolating a sample $\tilde{x}$ from generative distribution and a sample $x$ from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single -step power iteration, because the computation of $||\nabla_{\hat{x}} f ||_2$ requires one whole round of forward and backward propagation.

Experimental settings and results

Objective function

For all methods other than WGAN-GP, we use $V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]$ to update D, for the updates of G, use $-E_{z\sim p(z)}[\log(D(G(z)))]$. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by $V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))]$, $V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))]$

For WGAN-GP, we choose $V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]-$ $\lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]$

Criticisms

There is some concern regarding whether this model will be able to provide a truly scalable solution to MT. In particular, it is not obvious that this model will be able to sufficiently scale to long sentences as is evident in the reported results. The model is severely limited, in general, by working only in the absence of infrequent words. These theoretical limitations alongside sparse experimental results give rise to skepticism about the overarching validity of the model.

Source

Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems 27 3104–3112 (2014). <references />