stat946w18/Spectral normalization for generative adversial network: Difference between revisions

From statwiki
Jump to navigation Jump to search
 
(102 intermediate revisions by 15 users not shown)
Line 4: Line 4:


= Introduction =
= Introduction =
Generative adversarial networks(GANs)(Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.


A persisting challenge challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target. (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete  stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete  stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.


In this paper, we propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. Our normalization enjoys following favorable properties.
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:
1. Lipshitz constant is the only hyper-parameter to be tuned, and algorithm does not require intensive tuning of the only hyper-parameter for satisfactory purpose.
2. Implementation of is simple and the additional computational cost is small.


In this study, we provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value
* The additional computational needed to implement spectral normalization is small 
 
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.


= Model =
= Model =


Let us consider a simple discriminator made of a neural network of the following form, with the input x:
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively.
The conventional form of <math>V(G,D) </math> is given by:
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below:
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\] Where  \[f^{*} (x) = log q_{data} (x) - log p_{G} (x)\]
Thus we will be having:
\[\triangledown_{x} f^{*}(x) = \frac{1}{q_{data}(x)} \triangledown_{x}q_{data}(x) - \frac{1}{P_{G}(x)} \triangledown_{x}P_{G}(x)\]


=== Theory of Recurrent Neural Networks ===
which can be unbounded or even incomputable. This allows us to introduce some regularity condition to the derivative of f(x).


Reall that an Artifical Neural Network (ANN) is a nonlinear function <math>\mathcal{N}:\mathbb{R}^{p_0}
Now we can look back to the discriminator: we search for the discriminator D from the set of K-lipshitz continuous functions, that is,  
\rightarrow \mathbb{R}</math> that computes an output through iterative nonlinear function applications to an input vector <math>\mathbf{z}_0\in
\mathbb{R}^{p_0}</math>. The updates are said to occur in ''layers'', producing the sequence of vectors associated with each layer as <math>\left\{\mathbf{z}_0,\mathbf{z}_1,\dots,\mathbf{z}_N\right\}</math>. Any <math>\mathbf{z}_n</math> for <math>n \in
\left\{1,\ldots,N\right\}</math> is computed recursively from <math>\mathbf{z}_0</math> with the ''weight matrices'' and ''bias vectors'' as per the rule


<math>
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]
    \mathbf{z}_n =
                \sigma_n\left(\mathbf{W}_{n}\mathbf{z}_{n-1} + \mathbf{b}_{n}\right), \quad 1 \leq n \leq N</math>


where <math>\mathbf{W}_{n} \in \mathbb{R}^{p_n \times p_{n-1}}</math> is the weight matrix and <math>\mathbf{b}_{n} \in \mathbb{R}^{p_n}</math> is the bias vector associated with the <math>n</math>th layer in the network.
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.


The element-wise vector function <math>\sigma_n\left(\cdot\right)</math> is sigmoid-like for each component in its domain, outputing a value that ranges in <math>[0,1]</math>. Typically, the functions <math>\left\{\sigma_n\left(\cdot\right)\right\}</math> are the same for <math>n < N</math>, but the final output <math>\sigma_N(\cdot)</math> depends on the network architecture—for instance, it may be a softmax function for multi-label classification. Thus the network is completed characterized by its weights and biases as the tuple <math>(\left\{\mathbf{W}_{n}\right\},\left\{\mathbf{b}_{n}\right\})</math>.
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:


A sample network for <math>N = 2</math> is depicted below: a graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]
<center>
[[File:ann.png | frame | center |Fig 1. Graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.
]]
</center>


A Recurrent Neural Network is a generalization of an ANN for a ''sequence'' of inputs <math>\left\{\mathbf{z}_0^{[t]}\right\}</math> where <math>t \in
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:
\left\{1,\ldots,T\right\}</math> such that there are ''recurrent'' connections between the intermediary vectors <math>\left\{\mathbf{z}_n\right\}</math> for different so-called ''time steps''. These connections are made to represent conditioning on the previous vectors in the sequence: supposing the sequence were a vectorized representation of the words, an input to the network could be: <math>\left\{\mathbf{z}_0^{[1]},\mathbf{z}_0^{[2]},\mathbf{z}_0^{[3]}\right\} =
\left\{\text{pass}, \text{the}, \text{sauce}\right\}</math>. In a language modelling problem for predictive text, the probability of obtaining <math>\mathbf{z}_0^{[3]}</math> is strongly conditioned on the previous words in the sequence. As such, additional recurrence weight matrices are added to the update rule for <math>1 \leq n \leq N</math> and <math>t > 1</math> to produce the recurrent update rule


<math>
\[ \bar{W_{SN}}:= W/\sigma(W) \]
    \mathbf{z}_n^{[t]} =  
        \sigma_n\left(
            \mathbf{b}_{n}
        +  \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]}
        +  \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}
        \right)</math>


where <math>\mathbf{R}_n \in \mathbb{R}^{p_n \times p_n}</math> is the recurrence matrix that relates the <math>n</math>th layer’s output for item <math>t</math> to its previous output for item <math>t-1</math>. The network architecture for a single layer <math>n</math> at step <math>t</math> is pictured below. This is a schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured.
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix.  Now we can calculate the gradient of new parameter W by chain rule:
<center>
[[File:rnn.png | frame | center |Fig 2. Schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured. ]]
</center>


=== RNN Architecture by Graves, 2013 ===
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]


The RNN update rule used by Sutskever et al. comes from a paper by Graves (2013). The connections between layers are denser in this case. The final layer is fully connected to every preceding layer execept for the input <math>\mathbf{z}_0^{[t]}</math>, and follows the update rule
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]


<math>
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.
        \mathbf{z}_{N}^{[t]} = \sigma_n\left(
            \mathbf{b}_N
        + \displaystyle\sum_{n' = 1}^{N-1} \mathbf{W}_{N,n'}\mathbf{z}_{n'}^{[t]}
    \right)</math>


where <math>\mathbf{W}_{N,n'} \in \mathbb{R}^{p_N\times p_{n'}}</math> denotes the weight matrix between layer <math>n'</math> and <math>N</math>.
To understand the above computation in more detail, note that
\begin{align}
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).
\end{align}
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the  the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:


The layers 2 through <math>N-1</math> have additional connections to <math>\mathbf{z}_0^{[t]}</math> as
\begin{align}
\partial \sigma = \text{convex hull} \{ u v^T:  u,v \text{ are left/right singular vectors associated with }  \sigma(W)  \}.
\end{align}


<math>
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector.  Thus <math> \sigma </math> is differentiable and
        \mathbf{z}_n^{[t]} = \sigma_n\left(
\begin{align}
            \mathbf{b}_{n}
\nabla_W \sigma(W) =u_1v_1^T,
        +  \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]}
\end{align}
        +  \mathbf{W}_{n,0}\mathbf{z}_0^{[t]}
which explains the above computation.
        +  \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}
    \right),</math>


where, again, <math>\mathbf{W}_{n,n'}</math> must be of size <math>\mathbb{R}^{p_n\times
= Spectral Normalization VS Other Regularization Techniques =
    p_{n'}}</math>. The first layer has the typical RNN input rule as before,


<math>
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:
        \mathbf{z}_{1}^{[t]} = \sigma_1\left(
            \mathbf{b}_{1}
        +  \mathbf{W}_{1}\mathbf{z}_{0}^{[t]}
        +  \mathbf{R}_{1}\mathbf{z}_{1}^{[t-1]}
    \right).
</math>


=== Long Short-Term Memory Recurrent Neural Network (LSTM) ===
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A.
[http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ Recurrent neural networks] are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers.<ref name=lstm>
Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780.
</ref> Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input, because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks, whereas they can impact current input in a recurrent neural network. (This paper used the LSTM formulation from Graves<ref name=grave>
Graves, Alex. [http://arxiv.org/pdf/1308.0850.pdf "Generating sequences with recurrent neural networks."] arXiv preprint arXiv:1308.0850 (2013).
</ref>)


Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features.


This form of input fits naturally with language translation, as sentences are sequences of words, and many problems regarding representing variable length sentences as fixed length vectors can be avoided. However, training recurrent neural networks to learn long time lag dependencies--where inputs many time steps back can heavily influence current output--is difficult, and generally results in exploding or vanishing gradients. A variation of recurrent neural networks, the [http://colah.github.io/posts/2015-08-Understanding-LSTMs/ long short-term memory neural network], was used instead for this paper, as they do not suffer as much from the vanishing gradient problem.
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:


<math> ||W^TW-I||^2_F </math>


The purpose of LSTM in this case is to estimate the conditional probability of the output sequence, <math>\,(y_1,\cdots,y_{T'})</math>, based on the input sequence, <math>\,(x_1,\cdots,x_{T})</math>, where <math>\,T</math> does not have to equal <math>\,T'</math>
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one.


Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.


Let <math>\,v</math> represent the state of hidden layers after <math>\,(x_1,\cdots,x_{T})</math> have been inputted into the LSTM, i.e. what has been stored in the neural network's memory. Then
= Experimental settings and results =
== Objective function ==
For all methods other than WGAN-GP, we use
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math>
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math>
For WGAN-GP, we choose
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math>


:<math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})=\prod_{t=1}^{T'} p(y_t|v,y_1,\cdots,y_{t-1})</math>
== Optimization ==
Adam optimizer: 6 settings in total, related to
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam.
* learning rate <math> \alpha </math>
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam


For each <math>\,p(y_t|v,y_1,\cdots,y_{t-1})</math>, The LSTM neural network at time step <math>\,t</math> after <math>\,(x_1,\cdots,x_T,y_1,\cdots,y_{t-1})</math> have been inputted would output the relative probability of each word in the vocabulary and softmax function, <math>\,\frac{e^{x_b}}{\sum_{t=1}^N e^{x_t}}\,</math> can be applied to this output vector to generate the corresponding probability. From this, we can calculate any <math>\,p(y_1,\cdots,y_{T'}|x_1,\cdots,x_{T})</math> by repeatedly adding <math>\,y_t</math> as input into the LSTM neural network to calculate the new set of probabilities.
[[File:inception score.png]]


The objective function used during the training process was:
[[File:FID score.png]]


:<math>\,\frac{1}{|T_r|}\sum_{(S,T)\in T_r} \log(p(T|S))\,</math>
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.


Where <math>\,S</math> is the base/source sentence, <math>\,T</math> is the paired translated sentence and <math>\,T_r</math> is the total training set. This objective function is to maximize the log probability of a correct translation <math>\,T</math> given the base/source sentence <math>\,S</math> over the entire training set. Once the training is complete, translations are produced by finding the most likely translation according to LSTM:
== Singular values analysis on the weights of the discriminator D ==
[[File:singular value.png]]


:<math>\hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S)</math>
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.


<br />It has been showed that Long Short-Term Memory recurrent neural networks have the ability to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction <ref name=grave>
== Training time ==
Reference
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.
</ref>.


=== Input and Output Data Transformation ===
[[File:trainingTime.png|center]]
About 12 million English-French sentence pairs were used during the training with a vocabulary of 160,000 for English and 80,000 for French. Any unknown words were replaced with a special token. All sentences were attached with an <EOS> token to indicate end of sentence.


Additionally, input sentences were entered backwards as the researchers found this significantly increased accuracy. For example, using the sentence "Today I went to lectures.", the input order would be "lectures,to,went,I,Today". They suspect this is due to reduction of time lag between the beginning of each sentence.
== Comparison between GN-GANs and orthonormal regularization ==
[[File:comparison.png]]
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.


To decode a translation after training, a simple left to right beam search algorithm is used. This process goes as follows, a small number of initial translations with highest probabilities are picked at the start. Each translation is then re-entered into the LSTM independently and a new small set of words with highest probabilities are appended to the end of each translation. This repeats until <EOS> token is chosen and the completely translated sentence is added to the final translation set which is then ranked and highest ranking translation chosen.
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.


= Training and Results =
[[File:sngan.jpg]]
=== Training Method ===
Two LSTM neural networks were used overall; one to generate a fixed vector representation from the input sequence and another to generate the output sequence from the fixed vector representation. Each neural network had 4 layers and 1000 cells per layer and <math>\,v</math> can be represented by the 8000 real numbers in each cell's memory after the input sequence has been entered. Stochastic gradient descent with a batch size of 128 and learning rate of 0.7 was used. Initial parameters were set using a uniform distribution between -0.08 and 0.08. LSTM does not suffer from the vanishing gradient problem, but it can be affected by exploding gradients which is taken into account by enforcing a hard constraint on the norm of the gradient.


=== Scoring Method ===
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.
Scoring was done using the BLEU (Bilingual Evaluation Understudy) metric. This is an algorithm created for evaluating the quality of machine translated text. This is done by using a modified form of precision to compare a produced translation against a set of reference translations. This metric tends to correlate well with human judgement across a corpus, but performs badly if used to evaluate individual sentences. More information can be found in the [http://www.aclweb.org/anthology/P02-1040.pdf BLEU paper] and the [https://en.wikipedia.org/wiki/BLEU wikipedia article]. These resources both state that the BLEU score is a number between 0 and 1, with closer to 1 corresponding to a better translation. The LSTM paper reports BLEU scores in percentage values.


=== Results ===
= Algorithm of spectral normalization  =
The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:
<blockquote>
Active Voice: I ate an apple.
</blockquote>
<blockquote>
Passive Voice: The apple was eaten by me.
</blockquote>


An interesting result is the fact that reversing the source sentences (not test sentences) improved the long sentence decoding, which in turn increased the BLEU score from 25.9 to 30.6. While the authors do not have a complete explanation, they theorize the improvement in performance is due to the introduction of many short term dependencies to the data-set, by reversing the source sentences they minimize the time lag between the end of the source and the start of the target sentence. This reduction in the time lag is what the authors believe help the LSTM architecture establish a link between source and target and utilize the memory feature of the network better. Note, that the mean time lag does not change. Given the input sequence <math>(x_1, \dots, x_T)</math> and the target sequence <math>(y_1, \dots, y_T)</math>, the sequence of time lags is <math>\Delta t = (T, \dots, T)</math> and <math>\frac{1}{T} \sum_{\Delta t_i} T = T</math>. If, however, the input is reversed, the sequence of time lags of corresponding inputs is <math>\Delta t = (1, 3, \dots, 2T - 1)</math> which still has a mean time lag of <math>\frac{1}{T} \sum_{\Delta t_i} (2i - 1) = \frac{1}{T} \sum_{i = 1}^{T/2} (2i + 2(T-i)) = T</math> (assuming even T, but odd T can be shown similarly). Thus, half of the time lags are shorter with the reversed input sequence.
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution)
* For each update and each layer l:
** Apply power iteration method to a unnormalized weight <math> W^l </math>:


For example, let "I saw the man" be the source sentence, "with the binoculars" be the target sentence, if we concatenate both source and target sentences we have "I saw the man with the binoculars". By reversing the source sentence ("man the saw I") the subject "man" is now closer to the context target "binoculars", compared to if the source sentence is not reversed. Reversing the input sentences leads to more confident predictions in the early parts of the target sentence and to less confident predictions in the later parts.  Also, it results in LSTMs with better memory utilization.
\begin{align}
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2
\end{align}


In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].
\begin{align}
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||
\end{align}


=== Some developments of LSTM ===
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :
Dealing with the very rare words are a challenge for the LSTM. A weakness of LSTM is its inability to (correctly) translate the very rare words that comes up with out-of-vocabulary words, i.e. no translation. The long short-term dependencies that are induced in this method can lead to less likelihood for a long sequence words come after the unknown words. In the other hand, these untranslated words impose a longer temporal memory that decrease the overall efficiency of network. Sutskever I ([http://www.aclweb.org/anthology/P15-1002.pdf]) suggested a method to address the rare word problem. They assumed that if the origin of any undefined word is known then the word can be look up by introducing a post-processing step. This step would replace each unknown word in the system’s output with a translation of its source word. They proposed three strategies to track the source and translate it using either a dictionary or the identity translation.


=== Open questions ===
\begin{align}
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)
\end{align}


The results of the paper pose some interesting questions which are not discussed in the paper itself:
where


# Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).
\begin{align}
# For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}
\end{align}


= More Formulations of Recurrent Neural Networks =
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math>
The standard RNN is formalized as follows


:<math>\,h_t=\tanh(W_{hx}x_t+W_{hh}h_{t-1}+b_h)</math>
:<math>\,o_t=W_{oh}h_t+b_o</math>


Given sequence of input vectors <math>\,(x_1,\cdots,x_{T})</math>, the RNN computes a sequence of hidden states <math>\,(h_1,\cdots,h_{T})</math> and a sequence of output <math>\,(o_1,\cdots,o_{T})</math> by iterating the above equations. <math>\,W_{hx}</math> is the input to hidden weight matrix, <math>\,W_{hh}</math> is the hidden to hidden weight matrix, <math>\,W_{oh}</math> is the hidden to output weight matrix. Vector <math>\,b_{h}</math> and <math>\,b_{o}</math> are the biases. When t=1, the undefined <math>\,W_{hh}h_{t-1}</math> is replace with a special initial bias vector, <math>\,h_{init}</math>.
\begin{align}
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)
\end{align}


It may seem to train RNNs with gradient descent, but in reality, gradient decays exponentially as it is backpropagated through time. The relation between parameter and dynamics of the RNN is highly unstable, which makes gradient descent ineffective. Thus, it argues that RNN can not learn long-range temporal dependencies when gradient descent is used for training. A good way to deal with inability of gradient descent to learn long-range temporal structure in RNN is known as "Long-Short Term memory". (http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)


There are different variants of LSTM<ref name=grave>
== Conclusions ==
</ref><ref>
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.
Gers, Felix, and Jürgen Schmidhuber. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=861302&tag=1 "Recurrent nets that time and count."] Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 3. IEEE, 2000.
</ref><ref>
Cho, Kyunghyun, et al. [http://arxiv.org/pdf/1406.1078v3.pdf "Learning phrase representations using rnn encoder-decoder for statistical machine translation."] arXiv preprint arXiv:1406.1078 (2014).
</ref> other than the original one proposed by Hochreiter et al.<ref name=lstm>
</ref> Greff et al. compare the performance of some different popular variants in their work<ref>
Greff, Klaus, et al. [http://arxiv.org/pdf/1503.04069.pdf "LSTM: A Search Space Odyssey."] arXiv preprint arXiv:1503.04069 (2015).
</ref> and draw the conclusion that they are about the same. While Jozefowicz, et al. suggest that some architecture can perform better than LSTM on certain tasks<ref>
Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. [http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf "An Empirical Exploration of Recurrent Network Architectures."] Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.
</ref>.


= Criticisms =
== Open Source Code ==
There is some concern regarding whether this model will be able to provide a truly scalable solution to MT. In particular, it is not obvious that this model will be able to sufficiently scale to long sentences as is evident in the reported results. The model is severely limited, in general, by working only in the absence of infrequent words. These theoretical limitations alongside sparse experimental results give rise to skepticism about the overarching validity of the model.  
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.


= Source =
== References ==
Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).
with neural networks. In Proc. Advances in Neural Information
Processing Systems 27 3104–3112 (2014).
<references />

Latest revision as of 09:31, 18 April 2018

Presented by

1. liu, wenqing

Introduction

Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.

A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.

In this paper, the authors propose a novel weight normalization method called spectral normalization that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:

  • The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value
  • The additional computational needed to implement spectral normalization is small

In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.

Model

Let us consider a simple discriminator made of a neural network of the following form, with the input x:

\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]

where [math]\displaystyle{ \theta:=W^1,\cdots,W^L, W^{L+1} }[/math] is the learning parameters set, [math]\displaystyle{ W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} }[/math], and [math]\displaystyle{ a_l }[/math] is an element-wise non-linear activation function.The final output of the discriminator function is given by [math]\displaystyle{ D(x,\theta) = A(f(x,\theta)) }[/math]. The standard formulation of GANs is given by [math]\displaystyle{ \min_{G}\max_{D}V(G,D) }[/math] where min and max of G and D are taken over the set of generator and discriminator functions, respectively.

The conventional form of [math]\displaystyle{ V(G,D) }[/math] is given by:

\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]

where [math]\displaystyle{ q_{data} }[/math] is the data distribution and [math]\displaystyle{ p_G(x) }[/math] is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of [math]\displaystyle{ V(G,D) }[/math] is given by

\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]

Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below:

\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\] Where \[f^{*} (x) = log q_{data} (x) - log p_{G} (x)\]

Thus we will be having:

\[\triangledown_{x} f^{*}(x) = \frac{1}{q_{data}(x)} \triangledown_{x}q_{data}(x) - \frac{1}{P_{G}(x)} \triangledown_{x}P_{G}(x)\]

which can be unbounded or even incomputable. This allows us to introduce some regularity condition to the derivative of f(x).

Now we can look back to the discriminator: we search for the discriminator D from the set of K-lipshitz continuous functions, that is,

\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]

where we mean by [math]\displaystyle{ ||f||_{lip} }[/math] the smallest value M such that [math]\displaystyle{ ||f(x)-f(x')||/||x-x'||\le M }[/math] for any x,x', with the norm being the [math]\displaystyle{ l_2 }[/math] norm.

Our spectral normalization controls the Lipschitz constant of the discriminator function [math]\displaystyle{ f }[/math] by literally constraining the spectral norm of each layer [math]\displaystyle{ g: h_{in}\rightarrow h_{out} }[/math]. By definition, Lipschitz norm [math]\displaystyle{ ||g||_{Lip} }[/math] is equal to [math]\displaystyle{ \sup_h\sigma(\nabla g(h)) }[/math], where [math]\displaystyle{ \sigma(A) }[/math] is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer [math]\displaystyle{ g(h)=Wh }[/math], the norm is given by [math]\displaystyle{ ||g||_{Lip}=\sigma(W) }[/math]. Observing the following bound:

\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]

Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint [math]\displaystyle{ \sigma(W)=1 }[/math]:

\[ \bar{W_{SN}}:= W/\sigma(W) \]

In summary, just like what weight normalization does, we reparameterize weight matrix [math]\displaystyle{ \bar{W_{SN}} }[/math] as [math]\displaystyle{ W/\sigma(W) }[/math] to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:

\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]

\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]

where [math]\displaystyle{ E_{ij} }[/math] is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and [math]\displaystyle{ u_1, v_1 }[/math] are respectively the first left and right singular vectors of W.

To understand the above computation in more detail, note that \begin{align} \sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W). \end{align} By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:

\begin{align} \partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}. \end{align}

However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus [math]\displaystyle{ \sigma }[/math] is differentiable and \begin{align} \nabla_W \sigma(W) =u_1v_1^T, \end{align} which explains the above computation.

Spectral Normalization VS Other Regularization Techniques

The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the [math]\displaystyle{ l_2 }[/math] norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization [math]\displaystyle{ \bar{W_{WN}} }[/math]:

[math]\displaystyle{ \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) }[/math] where [math]\displaystyle{ \sigma_t(A) }[/math] is a t-th singular value of matrix A.

Note, if [math]\displaystyle{ \bar{W_{WN}} }[/math] is the weight normalized matrix of dimension [math]\displaystyle{ d_i*d_0 }[/math], the norm [math]\displaystyle{ ||\bar{W_{WN}}h||_2 }[/math] for a fixed unit vector [math]\displaystyle{ h }[/math] is maximized at [math]\displaystyle{ ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T }[/math] which means that [math]\displaystyle{ \bar{W_{WN}} }[/math] is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of [math]\displaystyle{ \bar{W_{WN}}h }[/math] large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features.

Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:

[math]\displaystyle{ ||W^TW-I||^2_F }[/math]

While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one.

Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e [math]\displaystyle{ ||\nabla_{\hat{x}} f ||_2 = 1 }[/math]) at discrete sets of points of the form [math]\displaystyle{ \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x }[/math] generated by interpolating a sample [math]\displaystyle{ \tilde{x} }[/math] from generative distribution and a sample [math]\displaystyle{ x }[/math] from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of [math]\displaystyle{ ||\nabla_{\hat{x}} f ||_2 }[/math] requires one whole round of forward and backward propagation.

Experimental settings and results

Objective function

For all methods other than WGAN-GP, we use [math]\displaystyle{ V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))] }[/math] to update D, for the updates of G, use [math]\displaystyle{ -E_{z\sim p(z)}[\log(D(G(z)))] }[/math]. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by [math]\displaystyle{ V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] }[/math], [math]\displaystyle{ V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] }[/math]

For WGAN-GP, we choose [math]\displaystyle{ V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)] }[/math]

Optimization

Adam optimizer: 6 settings in total, related to

  • [math]\displaystyle{ n_{dis} }[/math], the number of updates of the discriminator per one update of Adam.
  • learning rate [math]\displaystyle{ \alpha }[/math]
  • the first and second momentum parameters [math]\displaystyle{ \beta_1, \beta_2 }[/math] of Adam

The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.

Singular values analysis on the weights of the discriminator D

In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.

Training time

On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.

Comparison between GN-GANs and orthonormal regularization

Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.

We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.

Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.

Algorithm of spectral normalization

To calculate the largest singular value of matrix [math]\displaystyle{ W }[/math] to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:

  • Initialize [math]\displaystyle{ \tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L }[/math] with a random vector (sampled from isotropic distribution)
  • For each update and each layer l:
    • Apply power iteration method to a unnormalized weight [math]\displaystyle{ W^l }[/math]:

\begin{align} \tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2 \end{align}

\begin{align} \tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}|| \end{align}

  • Calculate [math]\displaystyle{ \bar{W_{SN}} }[/math] with the spectral norm :

\begin{align} \bar{W_{SN}}(W^l)=W^l/\sigma(W^l) \end{align}

where

\begin{align} \sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l} \end{align}

  • Update [math]\displaystyle{ W^l }[/math] with SGD on mini-batch dataset [math]\displaystyle{ D_M }[/math] with a learning rate [math]\displaystyle{ \alpha }[/math]


\begin{align} W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M) \end{align}


Conclusions

This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.

Open Source Code

The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.

References

  1. Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).