http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Ws2chen&feedformat=atomstatwiki - User contributions [US]2023-01-29T16:20:23ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36467Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-21T04:19:28Z<p>Ws2chen: /* Conclusion & Future Directions */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6. Here synchronous training refers to the process of training both the encoder and decoder at the same time.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003)<sup>[[#References|[8]]]</sup> and the dataset from Romani Picas et al.<sup>[[#References|[9]]]</sup> However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in Figure 2), and the attack (B in Figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in Figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in Figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in Figure 2). The authors do add that the WaveNet produces some strikes (L in Figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in Figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in Figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth. NSynth was inspired by image recognition datasets that have been core to recent progress in deep learning. The authro encourages the broader community to use NSynth as a benchmark and entry point into audio machine learning. The author also views NSynth as a building block for future datasets and envision a high-quality multi-note dataset for tasks like generation and transcription that involve learning complex language-like dependencies.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016<sup>[[#References|[7]]]</sup>), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Acknowlegments = <br />
A huge thanks to Hans Bernhard with the dataset, Colin Raffel for crucial conversations, and Sageev Oore for thoughtful analysis.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.<br />
# Goto, Masataka, Hashiguchi, Hiroki, Nishimura, Takuichi, and Oka, Ryuichi. Rwc music database: Music genre database and musical instrument sound database. 2003.<br />
# Romani Picas, Oriol, Parra Rodriguez, Hector, Dabiri, Dara, Tokuda, Hiroshi, Hariya, Wataru, Oishi, Koji, and Serra, Xavier. A real-time system for measuring sound goodness in instrumental sounds. In Audio Engineering Society Convention 138. Audio Engineering Society, 2015</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation&diff=36464stat946w18/Synthetic and natural noise both break neural machine translation2018-04-21T04:15:53Z<p>Ws2chen: /* Conclusion */</p>
<hr />
<div>== Introduction ==<br />
Humans have surprisingly robust language processing systems which can easily overcome disordered words, like the following example illustrated, an human reader may recognize the meaning of the following sentence with not much difficulty,<br />
<br />
* "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae."<br />
<br />
A person's ability to read this text comes as no surprise to the Psychology literature<br />
# Saberi & Perrott (1999) found that this robustness extends to audio as well.<br />
# Rayner et al. (2006) found that in noisier settings reading comprehension only slowed by 11%.<br />
# McCusker et al. (1981) found that the common case of swapping letters could often go unnoticed by the reader.<br />
# Mayall et al (1997) shows that we rely on word shape.<br />
# Reicher, 1969; Pelli et al., (2003) found that we can switch between whole word recognition but the first and last letter positions are required to stay constant for comprehension<br />
<br />
However, neural machine translation (NMT) systems are brittle. i.e. The Arabic word<br />
[[File:Good_morning.PNG]] means a blessing for good morning, however [[File:Hunt.PNG]] means hunt or slaughter. <br />
<br />
Facebook's MT system mistakenly confused two words that only differ by one character, a situation that is challenging for a character-based NMT system.<br />
<br />
The figure below shows the performance translating German to English as a function of the percent of German words modified. Here two types of noise are shown: (1) In blue, random permutation of the word and (2) In green, swapping a pair of adjacent letters that does not include the first or last letter of the word. The important thing to note is that even small amounts of noise lead to substantial drops in performance.<br />
<br />
[[File:BLEU_plot.PNG|center]] <br />
<br />
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is". BLEU is between 0 and 1. BELU computes the scores for individual translated segments and then computes an average accuracy score for the whole corpus.<br />
<br />
This paper explores two simple strategies for increasing model robustness:<br />
# using structure-invariant representations (character CNN representation)<br />
# robust training on noisy data, a form of adversarial training.<br />
<br />
The goal of the paper is two-fold:<br />
# to initiate a conversation on robust training and modeling techniques in NMT<br />
# to promote the creation of better and more linguistically accurate artificial noise to be applied to new languages and tasks<br />
<br />
== Adversarial examples ==<br />
The growing literature on adversarial examples has demonstrated how dangerous it can be to have brittle machine learning systems being used so pervasively in the real world. Small changes to the input can lead to dramatic<br />
failures of deep learning models. This leads to a potential for malicious attacks using adversarial examples. An important distinction is often drawn between white-box attacks, where adversarial examples are generated with<br />
access to the model parameters, and black-box attacks, where examples are generated without such access.<br />
<br />
The paper devises simple methods for generating adversarial examples for NMT. They do not assume any access to the NMT models' gradients, instead relying on cognitively-informed and naturally occurring language errors to generate noise.<br />
<br />
== MT system ==<br />
The authors experiment with three different NMT systems with access to character information at different levels.<br />
# Use <code>char2char</code>, the fully character-level model of (Lee et al. 2017). This model processes a sentence as a sequence of characters. The encoder works as follows: the characters are embedded as vectors, and then the sequence of vectors is fed to a convolutional layer. The sequence output by the convolutional layer is then shortened by max pooling in the time dimension. The output of the max-pooling layer is then fed to a four-layer highway network (Srivasta et al. 2015), and the output of the highway network is in turn fed to a bidirectional GRU, producing a sequence of hidden units. The sequence of hidden units is then processed by the decoder, a GRU with attention, to produce probabilities over sequences of output characters.<br />
# Use <code>Nematus</code> (Sennrich et al., 2017), a popular NMT toolkit. It is another sequence-to-sequence model with several architecture modifications, especially operating on sub-word units using byte-pair encoding. Byte-pair encoding (Sennich et al. 2015, Gage 1994) is an algorithm according to which we begin with a list of characters as our symbols, and repeatedly fuse common combinations to create new symbols. For example, if we begin with the letters a to z as our symbol list, and we find that "th" is the most common two-letter combination in a corpus, then we would add "th" to our symbol list in the first iteration. After we have used this algorithm to create a symbol list of the desired size, we apply a standard encoder-decoder with attention.<br />
# Use an attentional sequence-to-sequence model with a word representation based on a character convolutional neural network (<code>charCNN</code>). The <code>charCNN</code> model is similar to <code>char2char</code>, but uses a shallower highway network and, although it reads the input sentence as characters, it produces as output a probability distribution over words, not characters.<br />
<br />
== Data ==<br />
=== MT Data ===<br />
The authors use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems.<br />
<br />
[[File:Table1x.PNG|center]]<br />
<br />
=== Natural and Artificial Noise ===<br />
==== Natural Noise ====<br />
The three languages, French, German, and Czech, each have their own frequent natural errors. The corpora of edits used for these languages are:<br />
<br />
# French : Wikipedia Correction and Paraphrase Corpus (WiCoPaCo)<br />
# German : RWSE Wikipedia Correction Dataset and The MERLIN corpus<br />
# Czech : CzeSL Grammatical Error Correction Dataset (CzeSL-GEC) which is a manually annotated dataset of essays written by both non-native learners of Czech and Czech pupils<br />
<br />
The authors harvested naturally occurring errors (typos, misspellings, etc.) corresponding to these three languages from available corpora of edits to build a look-up table of possible lexical replacements.<br />
<br />
They insert these errors into the source-side of the parallel data by replacing every word in the corpus with an error if one exists in our dataset. When there is more than one possible replacement to choose, words for which there is no error, are sampled uniformly and kept as is.<br />
<br />
==== Synthetic Noise ====<br />
In addition to naturally collected sources of error, the authors also experiment with four types of synthetic noise: Swap, Middle Random, Fully Random, and Key Typo. <br />
# <code>Swap</code>: The first and simplest source of noise is swapping two letters (do not alter the first or last letters, only apply to words of length >=4).<br />
# <code>Middle Random</code>: Randomize the order of all the letters in a word except for the first and last (only apply to words of length >=4).<br />
# <code>Fully Random</code> Completely randomized words.<br />
# <code>Keyboard Typo</code> Randomly replace one letter in each word with an adjacent key<br />
<br />
[[File:Table3x.PNG|center]]<br />
<br />
Table 3 shows BLEU scores of models trained on clean (Vanilla) texts and tested on clean and noisy<br />
texts. All models suffer a significant drop in BLEU when evaluated on noisy texts. This is true<br />
for both natural noise and all kinds of synthetic noise. The more noise in the text, the worse the<br />
translation quality, with random scrambling producing the lowest BLEU scores.<br />
<br />
In contrast to the poor performance of these methods in the presence of noise, humans can perform very well as mentioned in the introduction. The table below shows the translations performed by a German native-speaker human, not familiar with the meme and three machine translation methods. Clearly, the machine translation methods failed. <br />
<br />
[[File:paper16_tab4.png|center]]<br />
<br />
The author also examined improvements by using a simple spell checker. The author tried correcting error through Google's spell checker by simply accepting the first suggestion on the detected mistake. There was a small improvement in French and German translations, and a small drop in accuracy for the Czech translation due to more complex grammar. The author concluded using existing spell checkers would not improve the accuracy to be comparable with vanilla text. The results are shown in the table below.<br />
<br />
<br />
[[File:paper16_tab5.png|center]]<br />
<br />
== Dealing with noise ==<br />
=== Structure Invariant Representations ===<br />
The three NMT models are all sensitive to word structure. The <code>char2char</code> and <code>charCNN</code> models both have convolutional layers on character sequences, designed to capture character n-grams (which are sequences of characters or words, of length n). The model in <code>Nematus</code> is based on sub-word units obtained with byte pair encoding (where common consecutive characters are replaced with a unique byte that does not occur in the data). It thus relies on character order.<br />
<br />
The simplest way to improve such a model is to take the average character embeddings as a word representation. This model, referred to as <code>meanChar</code>, first generates a word representation by averaging character embeddings, and then proceeds with a word-level encoder similar to the <code>charCNN</code> model.<br />
<br />
[[File:Table5x.PNG|center]]<br />
<br />
<code>meanChar</code> is good with the other three scrambling errors (Swap, Middle Random and Fully Random), but bad with Keyboard errors and Natural errors.<br />
<br />
=== Black-Box Adversarial Training ===<br />
<br />
<code>charCNN</code> Performance<br />
[[File:Table6x.PNG|center]]<br />
<br />
Here is the result of the translation of the scrambled meme:<br />
“According to a study of Cambridge University, it doesn’t matter which technology in a word is going to get the letters in a word that is the only important thing for the first and last letter.”<br />
<br />
== Analysis ==<br />
=== Learning Multiple Kinds of Noise in <code>charCNN</code> ===<br />
<br />
As Table 6 above shows, <code>charCNN</code> models performed quite well across different noise types on the test set when they are trained on a mix of noise types, which led the authors to speculate that filters from different convolutional layers learned to be robust to different types of noises. To test this hypothesis, they analyzed the weights learned by <code>charCNN</code> models trained on two kinds of input: completely scrambled words (Rand) without other kinds of noise, and a mix of Rand+Key+Nat kinds of noise. For each model, they computed the variance across the filter dimension for each one of the 1000 filters and for each one of the 25 character embedding dimensions, which were then averaged across the filters to yield 25 variances. <br />
<br />
As Figure 2 below shows, the variances for the ensemble model are higher and more varied, which indicates that the filters learned different patterns and the model differentiated between different character embedding dimensions. Under the random scrambling scheme, there should be no patterns for the model to learn, so it makes sense for the filter weights to stay close uniform weights, hence the consistently lower variance measures.<br />
<br />
[[File:Table7x.PNG|center]]<br />
<br />
=== Richness of Natural Noise ===<br />
[[File:SNNoise_NatNoiseExp.png|750px|right]]<br />
The synthetic noise used in this paper appears to be very different from natural noise. This is evident because none of the modes trained only on synthetic noise demonstrated good performance on natural noise. Therefore, the authors say that the noise models used in this paper are not representative of real noise and that a more sophisticated model using explicit phonemic and linguistic knowledge is required if an error-free corpus is to be augmented with error for training. The synthetic noise analysed is lacking a two common types of typos: inserting a character that is adjacent (on the keyboard) to a letter and omitting letters.<br />
<br />
During a manual analysis of a small subset of the German dataset, the natural noise was found to be comprised of:<br />
* 34% Phonetic error<br />
* 32% Character omissions<br />
* 34% Other: Morphological, Key swap, ect.<br />
<br />
Examples of these types of errors can be seen in Table 8.<br />
<br />
== Conclusion ==<br />
In this work, the authors have shown that character-based NMT models are extremely brittle and tend to break when presented with both natural and synthetic kinds of noise. After a comparison of the models, they found that a character-based CNN can learn to<br />
address multiple types of errors that are seen in training. For the future work, the author suggested generating more realistic synthetic noise by using phonetic and syntactic structure. Also, they suggested that a better NMT architecture could be designed which can be robust to noise without seeing it in the training data. The author believe that more work is necessary in order to immune NMT models against natural noise. As corpora with natural noise are limited, another approach to future work is to design better NMT architectures that would be robust to noise without seeing it in the training data. New psychology results on how humans cope with natural noise might point to possible solutions to this problem.<br />
<br />
== Criticism ==<br />
According to the [https://openreview.net/forum?id=BJ8vJebC- OpenReview thread], a major critique of this paper is that the solutions presented do not adequately solve the problem. The response to the meanChar architecture has been mostly negative and the method of noise injection has been seen as a simple start. However, the authors have acknowledged these critiques stating that they realize their solution is just a starting point. They argue that this paper has opened the discussion on dealing with noise in machine translation which has been mostly left untouched. Also these solutions/models still do not tackle the problem of natural noise as the models trained on the synthetic noise don't generalize well to natural noise. Also, a minor issue is that in Table 4, the results of machine translation from without noise is not included as a comparison.<br />
<br />
== References ==<br />
# Yonatan Belinkov and Yonatan Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. In ''International Conference on Learning Representations (ICLR)'', 2017.<br />
# Mauro Cettolo, Christian Girardi, and Marcello Federico. WIT: Web Inventory of Transcribed and Translated Talks. In ''Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)'', pp. 261–268, Trento, Italy, May 2012.<br />
# Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. ''Transactions of the Association for Computational Linguistics (TACL)'', 2017.<br />
# Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Laubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In ''Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics'', pp. 65–68, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/E17-3017.<br />
# Aurlien Max and Guillaume Wisniewski. Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. URL https://wicopaco.limsi.fr.<br />
# Katrin Wisniewski, Karin Schne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data, 10 2013. URL https://www.ukp.tu-darmstadt.de/data/spelling-correction/rwse-datasets.<br />
# Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538, Avignon, France, April 2012. Association for Computational Linguistics.<br />
# Suranjana Samanta and Sameep Mehta. Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812, 2017. Karel Sebesta, Zuzanna Bedrichova, Katerina Sormov́a, Barbora Stindlov́a, Milan Hrdlicka, Tereza Hrdlickov́a, Jiŕı Hana, Vladiḿır Petkevic, Toḿas Jeĺınek, Svatava Skodov́a, Petr Janes, Katerina Lund́akov́a, Hana Skoumalov́a, Simon Sĺadek, Piotr Pierscieniak, Dagmar Toufarov́a, Milan Straka, Alexandr Rosen, Jakub Ńaplava, and Marie Poĺackova. CzeSL grammatical error correction dataset (CzeSL-GEC). Technical report, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, 2017. URL https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2143.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation&diff=36453stat946w18/MaskRNN: Instance Level Video Object Segmentation2018-04-21T04:00:39Z<p>Ws2chen: /* Conclusion */</p>
<hr />
<div>== Introduction ==<br />
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences. Deforming shapes, fast movements, and occlusion from multiple objects, are just some of the significant challenges in instance level video object segmentation.<br />
<br />
There are 2 different types of video object segmentation, Unsupervised and Semi-supervised. <br />
* In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. <br />
* In a semi-supervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. <br />
<br />
In this paper, the authors look at an unsupervised video object segmentation technique.<br />
<br />
== Background Papers ==<br />
Video object segmentation has been performed using spatio-temporal graphs [[https://pdfs.semanticscholar.org/7221/c3470fa89879aab3ef270570ced15cde28de.pdf 5], [http://ieeexplore.ieee.org/abstract/document/5539893/ 6], [http://openaccess.thecvf.com/content_iccv_2013/papers/Li_Video_Segmentation_by_2013_ICCV_paper.pdf 7], [https://link.springer.com/content/pdf/10.1007/s11263-011-0512-5.pdf 8]] and deep learning. The graph based methods construct 3D spatio-temporal graphs in order to model the inter and the intra-frame relationship of pixels or superpixels in a video. Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following is a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.<br />
<br />
=== OSVOS (One-Shot Video Object Segmentation) ===<br />
<br />
[[File:OSVOS.jpg | 1000px]]<br />
<br />
This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.<br />
<br />
During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.<br />
<br />
=== MaskTrack (Learning Video Object Segmentation from Static Images) ===<br />
<br />
[[File:MaskTrack.jpg | 500px]]<br />
<br />
MaskTrack takes the output of the previous frame to improve its predictions and to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.<br />
<br />
The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame as ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.<br />
<br />
A parallel ConvNet network is used to generate a predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using an averaging operation to generate the final predicted segmented mask.<br />
<br />
Table 1 below gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.<br />
<br />
[[File:Paper19-SegmentationComp.png]]<br />
<br />
== Dataset ==<br />
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.<br />
<br />
== MaskRNN: Introduction ==<br />
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.<br />
<br />
== MaskRNN: Overview ==<br />
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.<br />
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>\text{argmax}</math> operation at test time.<br />
<br />
== MaskRNN: Multiple Instance Level Segmentation ==<br />
<br />
[[File:2ObjectSeg.jpg | 850px]]<br />
<br />
Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>\text{argmax}</math> operation where each pixel is assigned to the object containing the largest predicted probability.<br />
<br />
=== MaskRNN: Binary Segmentation Network ===<br />
<br />
[[File:MaskRNNDeepNet.jpg | 850px]]<br />
<br />
The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the pooling or fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers which are then linearly combined to form a feature representation that is the same size of the input image. This feature representation is then used to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.<br />
<br />
=== MaskRNN: Object Localization Network: ===<br />
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.<br />
<br />
=== Training and Finetuning ===<br />
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.<br />
<br />
== MaskRNN: Implementation Details ==<br />
=== Offline Training ===<br />
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information. <br />
<br />
For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames. (A flowNet (Dosovitskiy 2015) is a deep neural network trained to predict optical flow. The simplest form of flowNet has an architecture consisting of two parts. The first part accepts the two images between which the optical flow is to be computed as input, as applies a sequence of convolution and max-pooling operations, as in a standard convolutional neural network. In the second part, repeated up-convolution operations are applied, increasing the dimensions of the feature-maps. Besides the output of the previous upconvolution, each upconvolution is also fed as input the output of the corresponding down-convolution from the first part of the network. Thus part of the architecture resembles that of a U-net (Ronneberger, 2015). The output of the network is the predicted optical flow. ) <br />
<br />
=== Online Finetuning ===<br />
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to <math>10^{-5}</math> for online training for 200 iterations and the learning rate is gradually decayed over time. Data augmentation techniques similar to those in offline training, namely random resizing, rotating, cropping and flipping is applied. Also, it should be noted that the RNN is ''not'' employed during online finetuning since only a single frame of training data is available.<br />
<br />
== MaskRNN: Experimental Results ==<br />
=== Evaluation Metrics ===<br />
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:<br />
<br />
1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask. It calculates the average across all frames of the dataset. This is particularly challenging for small sized foreground objects.<br />
<br />
\begin{equation}<br />
IoU = \frac{|M \cap G|}{|M| + |G| - |M \cap G|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask, by calculating the the precision and the recall of the two sets of points on the contours of the ground truth segment and the output segment via a bipartite graph matching. It is a measure of accurate delineation of the foreground objects. <br />
<br />
[[File:Fscore.jpg | 200px|center]]<br />
<br />
3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.<br />
<br />
Region similarity measures the true segmented area in the prediction, while Contour Accuracy measures the accuracy of the contours/segmented mask boundary.<br />
<br />
=== Ablation Study ===<br />
<br />
The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.<br />
<br />
[[File:MaskRNNTable2.jpg | 700px|center]]<br />
<br />
The above table presents the contribution of each component of the network to the final prediction score. Online fine-tuning improves the performance by a large margin, as the network becomes adjusted to the appearance of the specific object being tracked. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net. The FStream provides information on motion boundaries which help in videos with cluttered backgrounds, the RNN provides more consistent segmentation masks over time. The localization net has a more ambiguous effect on the network; adding the bounding box regression loss decreases the performance of the segmentation net but applying the bounding box to restrict the segmentation mask improves the results over those achieved by only using the segmentation net. In other words the localization net should only be used in conjunction with the segmentation net while the segmentation net can be used by itself.<br />
<br />
=== Quantitative Evaluation ===<br />
<br />
The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.<br />
<br />
[[File:MaskRNNTable3.jpg | 700px]]<br />
<br />
The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.<br />
<br />
The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.<br />
<br />
[[File:MaskRNNTable4.jpg | 700px]]<br />
<br />
=== Qualitative Evaluation ===<br />
The authors showed example qualitative results from the DAVIS and Segtrack datasets. <br />
<br />
Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.<br />
<br />
[[File:maskrnn_example.png | 700px]]<br />
<br />
Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).<br />
<br />
[[File:maskrnn_example_fail.png | 700px]]<br />
<br />
== Conclusion ==<br />
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Due to the recurrent component and the combination of segmentation and localization nets, the approach takes advantage of the long-term temporal information and the location prior to improve the results. Using online fine-tuning the network is adjusted to predict better for the current video sequence.<br />
<br />
== Critique ==<br />
The paper provides a technique to track multiple objects in a video. The novelty is to add back-propagation through time to improve the tracking accuracy and using a localization network to remove any outliers in the segmented binary mask. However, the network architecture it too large and isn't able to run in real-time. There are N deep-Nets for N objects and each deep-Net contains 2 parallel VGG-16 convolutional networks.<br />
<br />
== Implementation ==<br />
<br />
The implementation of this paper was produced as part of the NIPS Paper Implementation Challenge. This implementation can be found at the following open source project: https://github.com/philferriere/tfvos.<br />
<br />
== References ==<br />
# Dosovitskiy, Alexey, et al. "Flownet: Learning optical flow with convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.<br />
# Hu, Y., Huang, J., & Schwing, A. "MaskRNN: Instance level video object segmentation". Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Ferriere, P. (n.d.). Semi-Supervised Video Object Segmentation (VOS) with Tensorflow. Retrieved March 20, 2018, from https://github.com/philferriere/tfvos<br />
# Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.<br />
# Lee, Yong Jae, Jaechul Kim, and Kristen Grauman. "Key-segments for video object segmentation." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.<br />
# Grundmann, Matthias, et al. "Efficient hierarchical graph-based video segmentation." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.<br />
# Li, Fuxin, et al. "Video segmentation by tracking many figure-ground segments." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.<br />
# Tsai, David, et al. "Motion coherent tracking using multi-label MRF optimization." International journal of computer vision 100.2 (2012): 190-202.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies&diff=36452stat946w18/Implicit Causal Models for Genome-wide Association Studies2018-04-21T03:52:35Z<p>Ws2chen: /* Critique */</p>
<hr />
<div>==Introduction and Motivation==<br />
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results. <br />
<br />
Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease. <br />
<br />
The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.<br />
<br />
[[File:gwas-example.jpg|500px|center]]<br />
<br />
This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.<br />
<br />
The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.<br />
<br />
For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).<br />
<br />
There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.<br />
<br />
==Implicit Causal Models==<br />
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.<br />
<br />
=== Probabilistic Causal Models ===<br />
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where<br />
<br />
[[File: eq1.1.png|800px|center]]<br />
<br />
Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，<br />
<br />
[[File: eqt1.png|800px|center]]<br />
<br />
The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.<br />
<br />
[[File: f_1.png|650px|center|]]<br />
<br />
<br />
An example of probabilistic causal models is additive noise model. <br />
<br />
[[File: eq2.1.png|800px|center]]<br />
<br />
<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as <br />
<br />
[[File: eqt2.png|800px|center]]<br />
<br />
where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.<br />
<br />
===Implicit Causal Models===<br />
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of using an additive noise term, implicit causal models directly take noise <math>\epsilon</math> as input and outputs <math>x</math> given parameter <math>\theta</math>.<br />
<br />
<math><br />
x=g(\epsilon | \theta), \epsilon \tilde s(\cdot)<br />
</math><br />
<br />
The causal diagram has changed to:<br />
<br />
[[File: f_2.png|650px|center|]]<br />
<br />
<br />
They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description: <br />
<br />
[[File: theorem.png|650px|center|]]<br />
<br />
==Implicit Causal Models with Latent Confounders==<br />
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.<br />
<br />
===Causal Inference with a Latent Confounder===<br />
Similar to before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.<br />
<br />
The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,<br />
<br />
[[File: eqt4.png|800px|center]]<br />
<br />
<br />
The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well. <br />
<br />
The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
Note that the latent structure <math>p(z|x, y)</math> is assumed known.<br />
<br />
In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:<br />
<br />
'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)<br />
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.<br />
<br />
Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.<br />
<br />
===Implicit Causal Model with a Latent Confounder===<br />
This section is the algorithm and functions to implementing an implicit causal model for GWAS.<br />
<br />
====Generative Process of Confounders <math>z_n</math>.====<br />
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural. <br />
<br />
====Generative Process of SNPs <math>x_{nm}</math>.====<br />
Given SNP is coded for,<br />
<br />
[[File: SNP.png|300px|center]]<br />
<br />
The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.<br />
<br />
[[File: gpx.png|800px|center]]<br />
<br />
A SNP matrix looks like this:<br />
[[File: SNP_matrix.png|200px|center]]<br />
<br />
<br />
Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,<br />
<br />
[[File: gpxnn.png|800px|center]]<br />
<br />
This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.<br />
<br />
====Generative Process of Traits <math>y_n</math>.====<br />
Previously, each trait is modeled by a linear regression,<br />
<br />
[[File: gpy.png|800px|center]]<br />
<br />
This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,<br />
<br />
[[File: gpynn.png|800px|center]]<br />
<br />
<br />
==Likelihood-free Variational Inference==<br />
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
could be reduces to <br />
<br />
[[File: lfvi1.png|800px|center]]<br />
<br />
However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,<br />
<br />
[[File: lfvi2.png|700px|center]]<br />
<br />
For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:<br />
[[File: em.png|800px|center]]<br />
<br />
==Empirical Study==<br />
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. <br />
Four methods were compared: <br />
<br />
* implicit causal model (ICM);<br />
* PCA with linear regression (PCA); <br />
* a linear mixed model (LMM); <br />
* logistic factor analysis with inverse regression (GCAT).<br />
<br />
The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization. <br />
<br />
===Simulation Study===<br />
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. <br />
There are four datasets used in this simulation study: <br />
<br />
# HapMap [Balding-Nichols model]<br />
# 1000 Genomes Project (TGP) [PCA]<br />
#* Human Genome Diversity project (HGDP) [PCA]<br />
#* HGDP [Pritchard-Stephens-Donelly model] <br />
# A latent spatial position of individuals for population structure [spatial]<br />
<br />
<br />
The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.<br />
<br />
[[File: table_1.png|650px|center|]]<br />
<br />
The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.<br />
<br />
<br />
===Real-data Analysis===<br />
They also applied ICM to GWAS of Northern Finland Birth Cohorts, which measure 10 metabolic traits and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted, one for each trait to be modeled. For each of the 10 implicit causal models the dimension of the counfounders was set to be six, same as what was used in the paper by Song. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. et al. for comparable models in Table 2.<br />
<br />
[[File: table_2.png|650px|center|]]<br />
<br />
The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.<br />
<br />
==Conclusion==<br />
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.<br />
<br />
By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.<br />
<br />
The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.<br />
<br />
==Critique==<br />
This paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics. While the author focusing on GWAS applications in the paper, the author also believes implicit causal models have significant potential in other sciences: for example, to design new dynamical theories in high energy physics; and to accurately model structural equations of discrete choices in economics.<br />
<br />
The neural network used in this paper is a very simple feed-forward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS. <br />
<br />
It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.<br />
<br />
Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.<br />
<br />
Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.<br />
This could be a future work as well.<br />
<br />
==References==<br />
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.<br />
<br />
Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.<br />
<br />
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.<br />
<br />
Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.<br />
<br />
Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.<br />
<br />
== Implicit causal model in Edward ==<br />
The author provides an example of an implicit causal model written in the Edward language.<br />
[[File: coddde.png|600px]]</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT&diff=36446stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT2018-04-21T03:26:51Z<p>Ws2chen: /* Critique */</p>
<hr />
<div>== Introduction ==<br />
Recently, the problem of how to learn models that generate media such as images, video, audio and text has become very popular and is called Generative Modeling. One of the main benefits of such an approach is that generative models can be trained on unlabeled data that is readily available . Therefore, generative networks have a huge potential in the field of deep learning.<br />
<br />
Generative Adversarial Networks (GANs) are powerful generative models used for unsupervised learning techniques where the 2 agents compete to generate a zero-sum model. A GAN model consists of a generator and a discriminator or critic. The generator is a neural network which is trained to generate data having a distribution matched with the distribution of the real data. The critic is also a neural network, which is trained to separate the generated data from the real data. A loss function that measures the distribution distance between the generated data and the real one is important to train the generator.<br />
<br />
Optimal transport theory, which is another approach to measuring distances between distributions, evaluates the distribution distance between the generated data and the training data based on a metric, which provides another method for generator training. The main advantage of optimal transport theory over the distance measurement in GAN is its closed form solution for having a tractable training process. But the theory might also result in inconsistency in statistical estimation due to the given biased gradients if the mini-batches method is applied (Bellemare et al.,<br />
2017).<br />
<br />
This paper presents a variant GANs named OT-GAN, which incorporates a discriminative metric called 'Mini-batch Energy Distance' into its critic in order to overcome the issue of biased gradients.<br />
<br />
== GANs and Optimal Transport ==<br />
<br />
===Generative Adversarial Nets===<br />
Original GAN was firstly reviewed. The objective function of the GAN: <br />
<br />
[[File:equation1.png|700px]]<br />
<br />
The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium(such that either of them cannot reduce their cost without changing the others' parameters). However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.<br />
<br />
===Wasserstein Distance (Earth-Mover Distance)===<br />
<br />
In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.<br />
<br />
[[File:equation2.png|600px]]<br />
<br />
where <math> \prod (p,g) </math> is the set of all joint distributions <math> \gamma (x,y) </math> with marginals <math> p(x) </math> (real data), <math> g(y) </math> (generated data). <math> c(x,y) </math> is a cost function and the Euclidean distance was used by Arjovsky et. al. in the paper. <br />
<br />
The Wasserstein distance can be considered as moving the minimum amount of points between distribution <math> g(y) </math> and <math> p(x) </math> such that the generator distribution <math> g(y) </math> is similar to the real data distribution <math> p(x) </math>.<br />
<br />
Computing the Wasserstein distance is intractable. The proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into Kantorovich-Rubinstein dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.<br />
<br />
[[File:equation3.png|600px]]<br />
<br />
W-GAN helps to solve the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.<br />
<br />
===Sinkhorn Distance===<br />
Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Wasserstein distance.<br />
[[File: equation4.png|600px]]<br />
<br />
It introduced entropy restriction (<math> \beta </math>) to the joint distribution <math> \prod_{\beta} (p,g) </math>. This distance could be generalized to approximate the mini-batches of data <math> X ,Y</math> with <math> K </math> vectors of <math> x, y</math>. The <math> i, j </math> th entry of the cost matrix <math> C </math> can be interpreted as the cost it needs to transport the <math> x_i </math> in mini-batch X to the <math> y_i </math> in mini-batch <math>Y </math>. The resulting distance will be:<br />
<br />
[[File: equation5.png|550px]]<br />
<br />
where <math> M </math> is a <math> K \times K </math> matrix, each row of <math> M </math> is a joint distribution of <math> \gamma (x,y) </math> with positive entries. The summmation of rows or columns of <math> M </math> is always equal to 1. <br />
<br />
This mini-batch Sinkhorn distance is not only fully tractable but also capable of solving the instability problem of GANs. However, it is not a valid metric over probability distribution when taking the expectation of <math> \mathcal{W}_{c} </math> and the gradients are biased when the mini-batch size is fixed.<br />
<br />
===Energy Distance (Cramer Distance)===<br />
In order to solve the above problem, Bellemare et al. proposed Energy distance:<br />
<br />
[[File: equation6.png|700px]]<br />
<br />
where <math> x, x' </math> and <math> y, y'</math> are independent samples from data distribution <math> p </math> and generator distribution <math> g </math>, respectively. Based on the Energy distance, Cramer GAN is to minimize the ED distance metric when training the generator.<br />
<br />
==Mini-Batch Energy Distance==<br />
Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when using the distributions over mini-batches <math> g(X), p(X) </math>. The distance measure is generated for mini-batches.<br />
<br />
===Generalized Energy Distance===<br />
The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches and is considered better than working with individual data batch.<br />
<br />
[[File: equation7.png|670px]]<br />
<br />
Similarly as defined in the Energy distance, <math> X, X' </math> and <math> Y, Y'</math> can be the independent samples from data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. While in Generalized engergy distance, <math> X, X' </math> and <math> Y, Y'</math> can also be valid for mini-batches. The <math> D_{GED}(p,g) </math> is a metric when having <math> d </math> as a metric. Thus, taking the triangle inequality of <math> d </math> into account, <math> D(p,g) \geq 0,</math> and <math> D(p,g)=0 </math> when <math> p=g </math>.<br />
<br />
===Mini-Batch Energy Distance===<br />
As <math> d </math> is free to choose, authors proposed Mini-batch Energy Distance by using entropy-regularized Wasserstein distance as <math> d </math>. <br />
<br />
[[File: equation8.png|650px]]<br />
<br />
where <math> X, X' </math> and <math> Y, Y'</math> are independent sampled mini-batches from the data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. This distance metric combines the energy distance with primal form of optimal transport over mini-batch distributions <math> g(Y) </math> and <math> p(X) </math>. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches. By adding the <math> - \mathcal{W}_c (Y,Y')</math> and <math> \mathcal{W}_c (X,Y)</math> to equation (5) and using energy distance, the objective becomes statistically consistent (meaning the objective converges to the true parameter value for large sample sizes) and mini-batch gradients are unbiased.<br />
<br />
==Optimal Transport GAN (OT-GAN)==<br />
<br />
The mini-batch energy distance which was proposed depends on the transport cost function <math>c(x,y)</math>. One possibility would be to choose c to be some fixed function over vectors, like Euclidean distance, but the authors found this to perform poorly in preliminary experiments. For simple fixed cost functions like Euclidean distance, there exists many bad distributions <math>g</math> in higher dimensions for which the mini-batch energy distance is zero such that it is difficult to tell <math>p</math> and <math>g</math> apart if the sample size is not big enough. To solve this the authors propose learning the cost function adversarially, so that it can adapt to the generator distribution <math>g</math> and thereby become more discriminative. <br />
<br />
In practice, in order to secure the statistical efficiency (i.e. being able to tell <math>p</math> and <math>g</math> apart without requiring an enormous sample size when their distance is close to zero), authors suggested using cosine distance between vectors <math> v_\eta (x) </math> and <math> v_\eta (y) </math> based on the deep neural network that maps the mini-batch data to a learned latent space. Here is the transportation cost:<br />
<br />
[[File: euqation9.png|370px]]<br />
<br />
where the <math> v_\eta </math> is chosen to maximize the resulting minibatch energy distance.<br />
<br />
Unlike the practice when using the original GANs, the generator was trained more often than the critic, which keep the cost function from degeneration. The resulting generator in OT-GAN has a well defined and statistically consistent objective through the training process.<br />
<br />
The algorithm is defined below. The backpropagation is not used in the algorithm since ignoring this gradient flow is justified by the envelope theorem (i.e. when changing the parameters of the objective function, changes in the optimizer do not contribute to a change in the objective function). Stochastic gradient descent is used as the optimization method in algorithm 1 below, although other optimizers are also possible. In fact, Adam was used in experiments. <br />
<br />
[[File: al.png|600px]]<br />
<br />
<br />
[[File: al_figure.png|600px]]<br />
<br />
==Experiments==<br />
<br />
In order to demonstrate the supermum performance of the OT-GAN, authors compared it with the original GAN and other popular models based on four experiments: Dataset recovery; CIFAR-10 test; ImageNet test; and the conditional image synthesis test.<br />
<br />
===Mixture of Gaussian Dataset===<br />
OT-GAN has a statistically consistent objective when it is compared with the original GAN (DC-GAN), such that the generator would not update to a wrong direction even if the signal provided by the cost function to the generator is not good. In order to prove this advantage, authors compared the OT-GAN with the original GAN loss (DAN-S) based on a simple task. The task was set to recover all of the 8 modes from 8 Gaussian mixers in which the means were arranged in a circle. MLP with RLU activation functions were used in this task. The critic was only updated for 15K iterations. The generator distribution was tracked for another 25K iteration. The results showed that the original GAN experiences the model collapse after fixing the discriminator while the OT-GAN recovered all the 8 modes from the mixed Gaussian data.<br />
<br />
[[File: 5_1.png|600px]]<br />
<br />
===CIFAR-10===<br />
<br />
The dataset CIFAR-10 was then used for inspecting the effect of batch-size to the model training process and the image quality. OT-GAN and four other methods were compared using "inception score" as the criteria for comparison. Figure 3 shows the change of inceptions scores (y-axis) by the increased of the iteration number. Scores of four different batch sizes (200, 800, 3200 and 8000) were compared. The results show that a larger batch size, which would more likely cover more modes in the distribution of data, lead to a more stable model showing a larger value in inception score. However, a large batch size would also require a high-performance computational environment. The sample quality across all 5 methods, ran using a batch size of 8000, are compared in Table 1 where the OT_GAN has the best score.<br />
<br />
The OT-GAN was trained using Adam optimizer. The learning rate was set to <math> 0.0003, \beta_1 = 0.5, \beta_2 = 0.999 </math> . The introduced OT-GAN algorithm also includes two additional hyperparameters for the Sinkhorn algorithm. The first hyperparameters indicated the number of iterations to run the algorithm and the second <math> 1 / \lambda </math> the entropy penalty of alignments. The authors found that a value of 500 worked well for both mentioned hyperparameters. The network uses the following architecture:<br />
<br />
[[File: cf10gc.png|600px]]<br />
<br />
[[File: 5_2.png|600px]]<br />
<br />
Figure 4 below shows samples generated by the OT-GAN trained with a batch size of 8000. Figure 5 below shows random samples from a model trained with the same architecture and hyperparameters, but with random matching of samples in place of optimal transport.<br />
<br />
[[File: ot_gan_cifar_10_samples.png|600px]]<br />
<br />
<br />
In order to show the advantage of learning the cost function adversarially, the CIFAR-10 experiment was re-run with the cost as follows:<br />
<br />
[[File: OTGAN_CosineDist.png|250px]]<br />
<br />
When using this fixed cost and keeping the other experiment settings constant, the max inception score dropped from 8.47 with learned to 4.93 with fixed cost functions. The results of the fixed cost are seen in Figure 8 below.<br />
<br />
[[File: OTGAN_fixedDist.png|600px]]<br />
<br />
===ImageNet Dogs===<br />
<br />
In order to investigate the performance of OT-GAN when dealing with the high-quality images, the dog subset of ImageNet (128*128) was used to train the model. Figure 6 shows that OT-GAN produces less nonsensical images and it has a higher inception score compare to the DC-GAN. <br />
<br />
[[File: 5_3.png|600px]]<br />
<br />
<br />
To analyze mode collapse in GANs the authors trained both types of GANs for a large number of epochs. They find the DCGAN shows mode collapse as soon as 900 epochs. They trained the OT-GAN for 13000 epochs and saw no evidence of mode collapse or less diversity in the samples. Samples can be viewed in Figure 9.<br />
<br />
[[File: ModelCollapseImageNetDogs.png|600px]]<br />
<br />
===Conditional Generation of Birds===<br />
<br />
The last experiment was to compare OT-GAN with three popular GAN models for processing the text-to-image generation demonstrating the performance on conditional image synthesis. As can be found from Table 2, OT-GAN received the highest inception score than the scores of the other three models. <br />
<br />
[[File: 5_4.png|600px]]<br />
<br />
The algorithm used to obtain the results above is conditional generation generalized from '''Algorithm 1''' to include conditional information <math>s</math> such as some text description of an image. The modified algorithm is outlined in '''Algorithm 2'''.<br />
<br />
[[File: paper23_alg2.png|600px]]<br />
<br />
==Conclusion==<br />
<br />
In this paper, an OT-GAN method was proposed based on the optimal transport theory. A distance metric that combines the primal form of the optimal transport and the energy distance was given was presented for realizing the OT-GAN. The results showed OT-GAN to be uniquely stable when trained with large mini batches and state of the art results were achieved on some datasets. One of the advantages of OT-GAN over other GAN models is that OT-GAN can stay on the correct track with an unbiased gradient even if the training on critic is stopped or presents a weak cost signal. The performance of the OT-GAN can be maintained when the batch size is increasing, though the computational cost has to be taken into consideration.<br />
<br />
==Critique==<br />
<br />
The paper presents a variant of GANs by defining a new distance metric based on the primal form of optimal transport and the mini-batch energy distance. The stability was demonstrated through the four experiments that comparing OP-GAN with other popular methods. However, limitations in computational efficiency were not discussed much. Furthermore, in section 2, the paper lacks explanation on using mini-batches instead of a vector as input when applying Sinkhorn distance. It is also confusing when explaining the algorithm in section 4 about choosing M for minimizing <math> \mathcal{W}_c </math>. Lastly, it is found that it is lack of parallel comparison with existing GAN variants in this paper. Readers may feel jumping from one algorithm to another without necessary explanations. However, one downside of OT-GAN, as mentioned in the paper, is that it requires large amounts of computation and memory.<br />
<br />
= Discussion =<br />
We have presented OT-GAN, a new variant of GANs where the generator is trained to minimize<br />
a novel distance metric over probability distributions. This metric, which we call mini-batch energy<br />
distance, combines optimal transport in primal form with an energy distance defined in an<br />
adversarially learned feature space, resulting in a highly discriminative distance function with unbiased<br />
mini-batch gradients. OT-GAN was shown to be uniquely stable when trained with large<br />
mini-batches and to achieve state-of-the-art results on several common benchmarks.<br />
One downside of OT-GAN, as currently proposed, is that it requires large amounts of computation<br />
and memory. We achieve the best results when using very large mini-batches, which increases the<br />
time required for each update of the parameters. All experiments in this paper, except for the mixture<br />
of Gaussians toy example, were performed using 8 GPUs and trained for several days. In future work<br />
we hope to make the method more computationally efficient, as well as to scale up our approach to<br />
multi-machine training to enable generation of even more challenging and high resolution image<br />
data sets.<br />
A unique property of OT-GAN is that the mini-batch energy distance remains a valid training objective<br />
even when we stop training the critic. Our implementation of OT-GAN updates the generative<br />
model more often than the critic, where GANs typically do this the other way around (see e.g. Gulrajani<br />
et al., 2017). As a result we learn a relatively stable transport cost function c(x, y), describing<br />
how (dis)similar two images are, as well as an image embedding function vη(x) capturing the geometry<br />
of the training data. Preliminary experiments suggest these learned functions can be used<br />
successfully for unsupervised learning and other applications, which we plan to investigate further<br />
in future work.<br />
<br />
==Reference==<br />
Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. "Improving GANs using optimal transport." (2018).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs&diff=36386Spherical CNNs2018-04-21T02:05:53Z<p>Ws2chen: /* Correlations on the Sphere and Rotation Group */</p>
<hr />
<div>= Introduction =<br />
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning. Applications where spherical CNNs can be applied include omnidirectional vision for robots, molecular regression problems, and weather/climate modelling.<br />
<br />
[[File:paper26-fig1.png|center]]<br />
<br />
The implementation of a spherical CNN is challenging mainly because no perfectly symmetrical grids for the sphere exists which makes it difficult to define the rotation of a spherical filter by one pixel and the computational efficiency of the system.<br />
<br />
The main contributions of this paper are the following:<br />
# The theory of spherical CNNs. The authors provide mathematical foundations for translation equivariance under a spherical framework.<br />
# The first automatically differentiable implementation of the generalized Fourier transform for <math>S^2</math> and SO(3). The provided PyTorch code by the authors is easy to use, fast, and memory efficient.<br />
# The first empirical support for the utility of spherical CNNs for rotation-invariant learning problems. They apply it to spherical MNIST, 3D shape classification, and molecular energy regression.<br />
<br />
=== Note: Translationally equivariant === <br />
<br />
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0.<br />
<br />
= Notation =<br />
Below are listed several important terms:<br />
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.<br />
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere<br />
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.<br />
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]<br />
<br />
= Related Work =<br />
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.<br />
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.<br />
<br />
= Correlations on the Sphere and Rotation Group =<br />
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:<br />
<br />
'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.<br />
<br />
'''The unit sphere''' <math>S^2</math> can be defined as the set of points <math>x ∈ R^3</math> with norm 1. It is a two-dimensional manifold, which can be parameterized by spherical coordinates α ∈ [0, 2π] and β ∈ [0, π]. <br />
<br />
'''Spherical Signals''' We model spherical images and filters as continuous functions f : <math>S^2</math> → <math>R^K</math>, where K is the number of channels.<br />
<br />
'''Rotations''' The set of rotations in three dimensions is called SO(3), the “special orthogonal group”. Rotations can be represented by 3 × 3 matrices that preserve distance (i.e. ||Rx|| = ||x||) and orientation (det(R) = +1). If we represent points on the sphere as 3D unit vectors x, we can perform a rotation using the matrix-vector product Rx. The rotation group SO(3) is a three-dimensional manifold, and can be parameterized by ZYZ-Euler angles α ∈ [0, 2π], β ∈ [0, π], and γ ∈ [0, 2π].<br />
<br />
'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.<br />
<br />
'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.<br />
<br />
'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.<br />
<br />
<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math><br />
<br />
<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.<br />
<br />
By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.<br />
<br />
<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math><br />
<br />
::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math><br />
<br />
'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.<br />
<br />
'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)<br />
<br />
<math>[L_Rf](Q)=f(R^{-1} Q)</math><br />
<br />
'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math><br />
<br />
where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.<br />
<br />
'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and: <br />
<br />
<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.<br />
<br />
= Implementation with GFFT =<br />
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:<br />
<br />
<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math><br />
<br />
where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.<br />
<br />
The inverse SO(3) Fourier transform is:<br />
<br />
<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math><br />
<br />
The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.<br />
<br />
The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).<br />
<br />
[[File:paper26-fig2.png|center]]<br />
<br />
A high-level, approximately-correct, somewhat intuitive explanation of the above figure is that the spherical signal <math> f </math> parameterized over <math> \alpha </math> and <math> \beta </math> having <math> k </math> channels is being correlated with a single filter <math> \psi </math> with the end result being a 3-D feature map on SO(3) (parameterized by Euler angles). The size in <math> \alpha </math> and <math> \beta </math> is the kernel size. The index <math> l </math> going from 0 to 3 correspond the degree of the basis functions used in the Fourier transform. As the degree goes up, so does the dimensionality of vector-valued (for spheres) basis functions. The signals involved are discrete, so the maximum degree (analogous to number of Fourier coefficients) depends on the resolution of the signal. The SO(3) basis functions are matrix-valued, but because <math> S^2 = SO(3)/SO(2) </math>, it ends up that the sphere basis functions correspond to one column in the matrix-valued SO(3) basis functions, which is why the outer product in the figure works.<br />
<br />
The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time as opposed to <math>O(n^2)</math> for FT) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.<br />
<br />
= Experiments =<br />
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.<br />
<br />
==Equivariance Error==<br />
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math><br />
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.<br />
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3. <br />
<br />
[[File:paper26-fig3.png|center]]<br />
<br />
<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.<br />
<br />
==MNIST Data==<br />
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.<br />
<br />
[[File:paper26-fig4.png|center]]<br />
<br />
The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.<br />
<br />
[[File:paper26-tab1.png|center]]<br />
<br />
==SHREC17==<br />
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.<br />
<br />
<br />
[[File:paper26-fig5.png|center]]<br />
<br />
<br />
The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. An important note is that the max pooling happens over the group SO(3): if <math>f_k</math> is the <math>\small k</math>-th filter in the final layer, the result of pooling is <math>max_{x \in SO(3)} f_k(x)</math>. 50 features were used for the <math>\small S^2</math> layer, while the two SO(3) layers used 70 and 350 features. Additionally, for each layer the resolution <math>\small b</math> was reduced from 128,32,22 to 7 in the final layer. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.<br />
<br />
This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.<br />
<br />
<br />
[[File:paper26-tab2.png|center]]<br />
<br />
==Molecular Atomization==<br />
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset (Blum et al. 2009) which has the task of predicting atomization energy of molecules. The QM7 dataset is a subset of GDB-13 (database of organic molecules) composed of all molecules up to 23 atoms. The positions and charges given in the dataset are projected onto the sphere using potential functions. For each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. Next, the radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.<br />
<br />
[[File:paper26-tab3.png|center]]<br />
<br />
[[File:paper26-f6.png|center]]<br />
<br />
= Conclusions =<br />
This paper presents a novel architecture called Spherical CNNs and evaluate it on 2 important learning problems and introduces a trainable signal representation for spherical signals rotationally equivariant by design. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs. The network is able to generalize across rotation and generate comparative results in the process.<br />
<br />
For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.<br />
<br />
= Commentary =<br />
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.<br />
<br />
= Source Code =<br />
Source code is available at:<br />
https://github.com/jonas-koehler/s2cnn<br />
<br />
= Sources =<br />
* T. Cohen et al. Spherical CNNs, 2018.<br />
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf<br />
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs&diff=36384Spherical CNNs2018-04-21T02:04:17Z<p>Ws2chen: /* Correlations on the Sphere and Rotation Group */</p>
<hr />
<div>= Introduction =<br />
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning. Applications where spherical CNNs can be applied include omnidirectional vision for robots, molecular regression problems, and weather/climate modelling.<br />
<br />
[[File:paper26-fig1.png|center]]<br />
<br />
The implementation of a spherical CNN is challenging mainly because no perfectly symmetrical grids for the sphere exists which makes it difficult to define the rotation of a spherical filter by one pixel and the computational efficiency of the system.<br />
<br />
The main contributions of this paper are the following:<br />
# The theory of spherical CNNs. The authors provide mathematical foundations for translation equivariance under a spherical framework.<br />
# The first automatically differentiable implementation of the generalized Fourier transform for <math>S^2</math> and SO(3). The provided PyTorch code by the authors is easy to use, fast, and memory efficient.<br />
# The first empirical support for the utility of spherical CNNs for rotation-invariant learning problems. They apply it to spherical MNIST, 3D shape classification, and molecular energy regression.<br />
<br />
=== Note: Translationally equivariant === <br />
<br />
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0.<br />
<br />
= Notation =<br />
Below are listed several important terms:<br />
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.<br />
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere<br />
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.<br />
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]<br />
<br />
= Related Work =<br />
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.<br />
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.<br />
<br />
= Correlations on the Sphere and Rotation Group =<br />
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:<br />
<br />
'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.<br />
<br />
'''The unit sphere''' <math>S^2</math> can be defined as the set of points <math>x ∈ R^3</math> with norm 1. It is a two-dimensional manifold, which can be parameterized by spherical coordinates α ∈ [0, 2π] and β ∈ [0, π]. <br />
<br />
'''Spherical Signals''' We model spherical images and filters as continuous functions f : <math>S^2</math> → <math>R^K</math>, where K is the number of channels.<br />
<br />
'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.<br />
<br />
'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.<br />
<br />
'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.<br />
<br />
<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math><br />
<br />
<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.<br />
<br />
By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.<br />
<br />
<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math><br />
<br />
::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math><br />
<br />
'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.<br />
<br />
'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)<br />
<br />
<math>[L_Rf](Q)=f(R^{-1} Q)</math><br />
<br />
'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math><br />
<br />
where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.<br />
<br />
'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and: <br />
<br />
<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.<br />
<br />
= Implementation with GFFT =<br />
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:<br />
<br />
<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math><br />
<br />
where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.<br />
<br />
The inverse SO(3) Fourier transform is:<br />
<br />
<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math><br />
<br />
The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.<br />
<br />
The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).<br />
<br />
[[File:paper26-fig2.png|center]]<br />
<br />
A high-level, approximately-correct, somewhat intuitive explanation of the above figure is that the spherical signal <math> f </math> parameterized over <math> \alpha </math> and <math> \beta </math> having <math> k </math> channels is being correlated with a single filter <math> \psi </math> with the end result being a 3-D feature map on SO(3) (parameterized by Euler angles). The size in <math> \alpha </math> and <math> \beta </math> is the kernel size. The index <math> l </math> going from 0 to 3 correspond the degree of the basis functions used in the Fourier transform. As the degree goes up, so does the dimensionality of vector-valued (for spheres) basis functions. The signals involved are discrete, so the maximum degree (analogous to number of Fourier coefficients) depends on the resolution of the signal. The SO(3) basis functions are matrix-valued, but because <math> S^2 = SO(3)/SO(2) </math>, it ends up that the sphere basis functions correspond to one column in the matrix-valued SO(3) basis functions, which is why the outer product in the figure works.<br />
<br />
The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time as opposed to <math>O(n^2)</math> for FT) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.<br />
<br />
= Experiments =<br />
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.<br />
<br />
==Equivariance Error==<br />
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math><br />
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.<br />
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3. <br />
<br />
[[File:paper26-fig3.png|center]]<br />
<br />
<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.<br />
<br />
==MNIST Data==<br />
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.<br />
<br />
[[File:paper26-fig4.png|center]]<br />
<br />
The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.<br />
<br />
[[File:paper26-tab1.png|center]]<br />
<br />
==SHREC17==<br />
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.<br />
<br />
<br />
[[File:paper26-fig5.png|center]]<br />
<br />
<br />
The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. An important note is that the max pooling happens over the group SO(3): if <math>f_k</math> is the <math>\small k</math>-th filter in the final layer, the result of pooling is <math>max_{x \in SO(3)} f_k(x)</math>. 50 features were used for the <math>\small S^2</math> layer, while the two SO(3) layers used 70 and 350 features. Additionally, for each layer the resolution <math>\small b</math> was reduced from 128,32,22 to 7 in the final layer. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.<br />
<br />
This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.<br />
<br />
<br />
[[File:paper26-tab2.png|center]]<br />
<br />
==Molecular Atomization==<br />
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset (Blum et al. 2009) which has the task of predicting atomization energy of molecules. The QM7 dataset is a subset of GDB-13 (database of organic molecules) composed of all molecules up to 23 atoms. The positions and charges given in the dataset are projected onto the sphere using potential functions. For each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. Next, the radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.<br />
<br />
[[File:paper26-tab3.png|center]]<br />
<br />
[[File:paper26-f6.png|center]]<br />
<br />
= Conclusions =<br />
This paper presents a novel architecture called Spherical CNNs and evaluate it on 2 important learning problems and introduces a trainable signal representation for spherical signals rotationally equivariant by design. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs. The network is able to generalize across rotation and generate comparative results in the process.<br />
<br />
For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.<br />
<br />
= Commentary =<br />
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.<br />
<br />
= Source Code =<br />
Source code is available at:<br />
https://github.com/jonas-koehler/s2cnn<br />
<br />
= Sources =<br />
* T. Cohen et al. Spherical CNNs, 2018.<br />
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf<br />
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs&diff=36383Spherical CNNs2018-04-21T02:02:09Z<p>Ws2chen: /* Correlations on the Sphere and Rotation Group */</p>
<hr />
<div>= Introduction =<br />
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning. Applications where spherical CNNs can be applied include omnidirectional vision for robots, molecular regression problems, and weather/climate modelling.<br />
<br />
[[File:paper26-fig1.png|center]]<br />
<br />
The implementation of a spherical CNN is challenging mainly because no perfectly symmetrical grids for the sphere exists which makes it difficult to define the rotation of a spherical filter by one pixel and the computational efficiency of the system.<br />
<br />
The main contributions of this paper are the following:<br />
# The theory of spherical CNNs. The authors provide mathematical foundations for translation equivariance under a spherical framework.<br />
# The first automatically differentiable implementation of the generalized Fourier transform for <math>S^2</math> and SO(3). The provided PyTorch code by the authors is easy to use, fast, and memory efficient.<br />
# The first empirical support for the utility of spherical CNNs for rotation-invariant learning problems. They apply it to spherical MNIST, 3D shape classification, and molecular energy regression.<br />
<br />
=== Note: Translationally equivariant === <br />
<br />
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0.<br />
<br />
= Notation =<br />
Below are listed several important terms:<br />
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.<br />
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere<br />
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.<br />
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]<br />
<br />
= Related Work =<br />
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.<br />
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.<br />
<br />
= Correlations on the Sphere and Rotation Group =<br />
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:<br />
<br />
'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.<br />
<br />
'''The unit sphere''' <math>S^2</math> can be defined as the set of points <math>x ∈ R^3</math> with norm 1. It is a two-dimensional manifold, which can be parameterized by spherical coordinates α ∈ [0, 2π] and β ∈ [0, π]. <br />
<br />
'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.<br />
<br />
'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.<br />
<br />
'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.<br />
<br />
<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math><br />
<br />
<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.<br />
<br />
By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.<br />
<br />
<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math><br />
<br />
::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math><br />
<br />
'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.<br />
<br />
'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)<br />
<br />
<math>[L_Rf](Q)=f(R^{-1} Q)</math><br />
<br />
'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math><br />
<br />
where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.<br />
<br />
'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and: <br />
<br />
<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.<br />
<br />
= Implementation with GFFT =<br />
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:<br />
<br />
<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math><br />
<br />
where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.<br />
<br />
The inverse SO(3) Fourier transform is:<br />
<br />
<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math><br />
<br />
The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.<br />
<br />
The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).<br />
<br />
[[File:paper26-fig2.png|center]]<br />
<br />
A high-level, approximately-correct, somewhat intuitive explanation of the above figure is that the spherical signal <math> f </math> parameterized over <math> \alpha </math> and <math> \beta </math> having <math> k </math> channels is being correlated with a single filter <math> \psi </math> with the end result being a 3-D feature map on SO(3) (parameterized by Euler angles). The size in <math> \alpha </math> and <math> \beta </math> is the kernel size. The index <math> l </math> going from 0 to 3 correspond the degree of the basis functions used in the Fourier transform. As the degree goes up, so does the dimensionality of vector-valued (for spheres) basis functions. The signals involved are discrete, so the maximum degree (analogous to number of Fourier coefficients) depends on the resolution of the signal. The SO(3) basis functions are matrix-valued, but because <math> S^2 = SO(3)/SO(2) </math>, it ends up that the sphere basis functions correspond to one column in the matrix-valued SO(3) basis functions, which is why the outer product in the figure works.<br />
<br />
The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time as opposed to <math>O(n^2)</math> for FT) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.<br />
<br />
= Experiments =<br />
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.<br />
<br />
==Equivariance Error==<br />
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math><br />
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.<br />
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3. <br />
<br />
[[File:paper26-fig3.png|center]]<br />
<br />
<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.<br />
<br />
==MNIST Data==<br />
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.<br />
<br />
[[File:paper26-fig4.png|center]]<br />
<br />
The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.<br />
<br />
[[File:paper26-tab1.png|center]]<br />
<br />
==SHREC17==<br />
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.<br />
<br />
<br />
[[File:paper26-fig5.png|center]]<br />
<br />
<br />
The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. An important note is that the max pooling happens over the group SO(3): if <math>f_k</math> is the <math>\small k</math>-th filter in the final layer, the result of pooling is <math>max_{x \in SO(3)} f_k(x)</math>. 50 features were used for the <math>\small S^2</math> layer, while the two SO(3) layers used 70 and 350 features. Additionally, for each layer the resolution <math>\small b</math> was reduced from 128,32,22 to 7 in the final layer. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.<br />
<br />
This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.<br />
<br />
<br />
[[File:paper26-tab2.png|center]]<br />
<br />
==Molecular Atomization==<br />
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset (Blum et al. 2009) which has the task of predicting atomization energy of molecules. The QM7 dataset is a subset of GDB-13 (database of organic molecules) composed of all molecules up to 23 atoms. The positions and charges given in the dataset are projected onto the sphere using potential functions. For each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. Next, the radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.<br />
<br />
[[File:paper26-tab3.png|center]]<br />
<br />
[[File:paper26-f6.png|center]]<br />
<br />
= Conclusions =<br />
This paper presents a novel architecture called Spherical CNNs and evaluate it on 2 important learning problems and introduces a trainable signal representation for spherical signals rotationally equivariant by design. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs. The network is able to generalize across rotation and generate comparative results in the process.<br />
<br />
For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.<br />
<br />
= Commentary =<br />
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.<br />
<br />
= Source Code =<br />
Source code is available at:<br />
https://github.com/jonas-koehler/s2cnn<br />
<br />
= Sources =<br />
* T. Cohen et al. Spherical CNNs, 2018.<br />
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf<br />
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs&diff=36382Spherical CNNs2018-04-21T02:01:44Z<p>Ws2chen: /* Correlations on the Sphere and Rotation Group */</p>
<hr />
<div>= Introduction =<br />
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning. Applications where spherical CNNs can be applied include omnidirectional vision for robots, molecular regression problems, and weather/climate modelling.<br />
<br />
[[File:paper26-fig1.png|center]]<br />
<br />
The implementation of a spherical CNN is challenging mainly because no perfectly symmetrical grids for the sphere exists which makes it difficult to define the rotation of a spherical filter by one pixel and the computational efficiency of the system.<br />
<br />
The main contributions of this paper are the following:<br />
# The theory of spherical CNNs. The authors provide mathematical foundations for translation equivariance under a spherical framework.<br />
# The first automatically differentiable implementation of the generalized Fourier transform for <math>S^2</math> and SO(3). The provided PyTorch code by the authors is easy to use, fast, and memory efficient.<br />
# The first empirical support for the utility of spherical CNNs for rotation-invariant learning problems. They apply it to spherical MNIST, 3D shape classification, and molecular energy regression.<br />
<br />
=== Note: Translationally equivariant === <br />
<br />
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0.<br />
<br />
= Notation =<br />
Below are listed several important terms:<br />
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.<br />
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere<br />
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.<br />
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]<br />
<br />
= Related Work =<br />
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.<br />
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.<br />
<br />
= Correlations on the Sphere and Rotation Group =<br />
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:<br />
<br />
'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.<br />
<br />
'''The unit sphere'' <math>S^2</math> can be defined as the set of points <math>x ∈ R^3</math> with norm 1. It is a two-dimensional manifold, which can be parameterized by spherical coordinates α ∈ [0, 2π] and β ∈ [0, π]. <br />
<br />
'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.<br />
<br />
'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.<br />
<br />
'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.<br />
<br />
<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math><br />
<br />
<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.<br />
<br />
By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.<br />
<br />
<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math><br />
<br />
::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math><br />
<br />
'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.<br />
<br />
'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)<br />
<br />
<math>[L_Rf](Q)=f(R^{-1} Q)</math><br />
<br />
'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math><br />
<br />
where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.<br />
<br />
'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and: <br />
<br />
<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.<br />
<br />
= Implementation with GFFT =<br />
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:<br />
<br />
<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math><br />
<br />
where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.<br />
<br />
The inverse SO(3) Fourier transform is:<br />
<br />
<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math><br />
<br />
The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.<br />
<br />
The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).<br />
<br />
[[File:paper26-fig2.png|center]]<br />
<br />
A high-level, approximately-correct, somewhat intuitive explanation of the above figure is that the spherical signal <math> f </math> parameterized over <math> \alpha </math> and <math> \beta </math> having <math> k </math> channels is being correlated with a single filter <math> \psi </math> with the end result being a 3-D feature map on SO(3) (parameterized by Euler angles). The size in <math> \alpha </math> and <math> \beta </math> is the kernel size. The index <math> l </math> going from 0 to 3 correspond the degree of the basis functions used in the Fourier transform. As the degree goes up, so does the dimensionality of vector-valued (for spheres) basis functions. The signals involved are discrete, so the maximum degree (analogous to number of Fourier coefficients) depends on the resolution of the signal. The SO(3) basis functions are matrix-valued, but because <math> S^2 = SO(3)/SO(2) </math>, it ends up that the sphere basis functions correspond to one column in the matrix-valued SO(3) basis functions, which is why the outer product in the figure works.<br />
<br />
The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time as opposed to <math>O(n^2)</math> for FT) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.<br />
<br />
= Experiments =<br />
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.<br />
<br />
==Equivariance Error==<br />
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math><br />
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.<br />
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3. <br />
<br />
[[File:paper26-fig3.png|center]]<br />
<br />
<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.<br />
<br />
==MNIST Data==<br />
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.<br />
<br />
[[File:paper26-fig4.png|center]]<br />
<br />
The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.<br />
<br />
[[File:paper26-tab1.png|center]]<br />
<br />
==SHREC17==<br />
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.<br />
<br />
<br />
[[File:paper26-fig5.png|center]]<br />
<br />
<br />
The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. An important note is that the max pooling happens over the group SO(3): if <math>f_k</math> is the <math>\small k</math>-th filter in the final layer, the result of pooling is <math>max_{x \in SO(3)} f_k(x)</math>. 50 features were used for the <math>\small S^2</math> layer, while the two SO(3) layers used 70 and 350 features. Additionally, for each layer the resolution <math>\small b</math> was reduced from 128,32,22 to 7 in the final layer. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.<br />
<br />
This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.<br />
<br />
<br />
[[File:paper26-tab2.png|center]]<br />
<br />
==Molecular Atomization==<br />
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset (Blum et al. 2009) which has the task of predicting atomization energy of molecules. The QM7 dataset is a subset of GDB-13 (database of organic molecules) composed of all molecules up to 23 atoms. The positions and charges given in the dataset are projected onto the sphere using potential functions. For each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. Next, the radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.<br />
<br />
[[File:paper26-tab3.png|center]]<br />
<br />
[[File:paper26-f6.png|center]]<br />
<br />
= Conclusions =<br />
This paper presents a novel architecture called Spherical CNNs and evaluate it on 2 important learning problems and introduces a trainable signal representation for spherical signals rotationally equivariant by design. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs. The network is able to generalize across rotation and generate comparative results in the process.<br />
<br />
For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.<br />
<br />
= Commentary =<br />
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.<br />
<br />
= Source Code =<br />
Source code is available at:<br />
https://github.com/jonas-koehler/s2cnn<br />
<br />
= Sources =<br />
* T. Cohen et al. Spherical CNNs, 2018.<br />
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf<br />
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders&diff=36375Wasserstein Auto-Encoders2018-04-20T22:12:31Z<p>Ws2chen: /* GANs and WAEs */</p>
<hr />
<div><br />
= Introduction =<br />
Recent years have seen a convergence of two previously distinct approaches: representation learning from high dimensional data, and unsupervised generative modeling. In the field that formed at their intersection, Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs) have emerged to become well-established. VAEs are theoretically elegant but with the drawback that they tend to generate blurry samples when applied to natural images. GANs on the other hand produce better visual quality of sampled images, but come without an encoder, are harder to train and suffer from the mode-collapse problem when the trained model is unable to capture all the variability in the true data distribution. There has been recent research in generating encoder-decoder GANs where an encoder is trained in parallel with the generator, based on the intuition that this will allow the GAN to learn meaningful mapping from the compressed representation to the original image; however, these models also suffer from mode-collapse and perform comparable to vanilla GANs. Thus there has been a push to come up with the best way to combine them together, but a principled unifying framework is yet to be discovered.<br />
<br />
This work proposes a new family of regularized auto-encoders called the Wasserstein Auto-Encoder (WAE). The proposed method provides a novel theoretical insight into setting up an objective function for auto-encoders from the point of view of of optimal transport (OT). This theoretical formulation leads the authors to examine adversarial and maximum mean discrepancy based regularizers for matching a prior and the distribution of encoded data points in the latent space. An empirical evaluation is performed on MNIST and CelebA datasets, where WAE is found to generate samples of better quality than VAE while preserving training stability, encoder-decoder structure and nice latent manifold structure.<br />
<br />
The main contribution of the proposed algorithm is to provide theoretical foundations for using optimal transport cost as the auto-encoder objective function, while blending auto-encoders and GANs in a principled way. It also theoretically and experimentally explores the interesting relationships between WAEs, VAEs and adversarial auto-encoders.<br />
<br />
= Proposed Approach =<br />
==Theory of Optimal Transport and Wasserstein Distance==<br />
Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. <br />
A simple case where the probability domain is discrete is presented below.<br />
<br />
<br />
[[File:em_distance.PNG|thumb|upright=1.4|center|Step-by-step plan of moving dirt between piles in ''P'' and ''Q'' to make them match (''W'' = 5).]]<br />
<br />
<br />
When dealing with the continuous probability domain, the EM distance or the minimum one among the costs of all dirt moving solutions becomes:<br />
\begin{align}<br />
\small W(p_r, p_g) = \underset{\gamma\sim\Pi(p_r, p_g)} {\inf}\pmb{\mathbb{E}}_{(x,y)\sim\gamma}[\parallel x-y\parallel]<br />
\end{align}<br />
<br />
Where <math>\Pi(p_r, p_g)</math> is the set of all joint probability distributions with marginals <math>p_r</math> and <math>p_g</math>. Here the distribution <math>\gamma</math> is called a transport plan because its marginal structure gives some intuition that it represents the amount of probability mass to be moved from x to y. This intuition can be explained by looking at the following equation.<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dx = p_g(y)<br />
\end{align}<br />
Which means that the total amount of dirt moved to point <math>y</math> is <math>p_g(y)</math>. Similarly, we have:<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dy = p_r(x)<br />
\end{align}<br />
Which means that the total amount of dirt moved out of point <math>x</math> is <math>p_r(x)</math><br />
<br />
The Wasserstein distance or the cost of Optimal Transport (OT) provides a much weaker topology, which informally means that it makes it easier for a sequence of distribution to converge as compared to other ''f''-divergences. This is particularly important in applications where data is supported on low dimensional manifolds in the input space. As a result, stronger notions of distances such as KL-divergence, often max out, providing no useful gradients for training. In contrast, OT has a much nicer linear behaviour even upon saturation. It can be shown that the Wasserstein distance has guarantees of continuity and differentiability (Arjovsky et al., 2017). Moreover, Arjovsky et al. show there is a nice relationship between the magnitude of the Wasserstein distance and the distance between distributions; a smaller distance nicely corresponds to a smaller distance between the two distributions, and vice versa.<br />
<br />
==Problem Formulation and Notation==<br />
In this paper, calligraphic letters, i.e. <math>\small {\mathcal{X}}</math>, are used for sets, capital letters, i.e. <math>\small X</math>, are used for random variables and lower case letters, i.e. <math>\small x</math>, for their values. Probability distributions are denoted with capital letters, i.e. <math>\small P(X)</math>, and corresponding densities with lower case letters, i.e. <math>\small p(x)</math>.<br />
<br />
This work aims to minimize OT <math>\small W_c(P_X, P_G)</math> between the true (but unknown) data distribution <math>\small P_X</math> and a latent variable model <math>\small P_G</math> specified by the prior distribution <math>\small P_Z</math> of latent codes <math>\small Z \in \pmb{\mathbb{Z}}</math> and the generative model <math>\small P_G(X|Z)</math> of the data points <math>\small X \in \pmb{\mathbb{X}}</math> given <math>\small Z</math>. <br />
<br />
Kantorovich's formulation of the OT problem is given by:<br />
\begin{align}<br />
\small W_c(P_X, P_G) := \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]}<br />
\end{align}<br />
where <math>\small c(x,y)</math> is any measurable cost function and <math>\small {\mathcal{P}(X \sim P_X,Y \sim P_G)}</math> is a set of all joint distributions of <math>\small (X,Y)</math> with marginals <math>\small P_X</math> and <math>\small P_G</math>. When <math>\small c(x,y)=d(x,y)</math>, the following Kantorovich-Rubinstein duality holds for the <math>\small 1^{st}</math> root of <math>\small W_c</math>:<br />
\begin{align}<br />
\small W_1(P_X, P_G) := \underset{f \in {\mathcal{F_L}}} {\sup} {\pmb{\mathbb{E}}_{X \sim P_X}[f(X)]} -{\pmb{\mathbb{E}}_{Y \sim P_G}[f(Y)]}<br />
\end{align}<br />
where <math>\small {\mathcal{F_L}}</math> is the class of all bounded [https://en.wikipedia.org/wiki/Lipschitz_continuity Lipschitz continuous functions]. A reference that provides an intuitive explanation for how the Kantorovich-Rubinstein duality was applied in this case is [https://vincentherrmann.github.io/blog/wasserstein/ here].<br />
<br />
==Wasserstein Auto-Encoders==<br />
The proposed method focuses on latent variables <math>\small P_G </math> defined by a two step procedure, where first a code <math>\small Z</math> is sampled from a fixed prior distribution <math>\small P_Z</math> on a latent space <math>\small {\mathcal{Z}}</math> and then <math>\small Z</math> is mapped to the image <math>\small X \in {\mathcal{X}}</math> with a transformation. This results in a density of the form<br />
\begin{align}<br />
\small p_G(x) := \int_{{\mathcal{Z}}} p_G(x|z)p_z(z)dz, \forall x\in{\mathcal{X}}<br />
\end{align}<br />
assuming all the densities are properly defined. It turns out that if the focus is only on generative models deterministically mapping <math>\small Z </math> to <math>\small X = G(Z) </math>, then the OT cost takes a much simpler form as stated below by Theorem 1.<br />
<br />
'''Theorem 1''' For any function <math>\small G:{\mathcal{Z}} \rightarrow {\mathcal{X}}</math>, where <math>\small Q(Z) </math> is the marginal distribution of <math>\small Z </math> when <math>\small X \sim P_X </math> and <math>\small Z \sim Q(Z|X) </math>,<br />
\begin{align}<br />
\small \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]} = \underset{Q : Q_z=P_z}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]}<br />
\end{align}<br />
This essentially means that instead of finding a coupling <math>\small \Gamma </math> between two random variables living in the <math>\small {\mathcal{X}} </math> space, one distributed according to <math>\small P_X </math> and the other one according to <math>\small P_G </math>, it is sufficient to find a conditional distribution <math>\small Q(Z|X) </math> such that its <math>\small Z </math> marginal <math>\small Q_Z(Z) := {\pmb{\mathbb{E}}_{X \sim P_X}[Q(Z|X)]} </math> is identical to the prior distribution <math>\small P_Z </math>. In order to implement a numerical solution to Theorem 1, the constraints on <math>\small Q(Z|X) </math> and <math>\small P_Z </math> are relaxed and a penalty function is added to the objective leading to the WAE objective function given by:<br />
<br />
\begin{align}<br />
\small D_{WAE}(P_X, P_G):= \underset{Q(Z|X) \in Q}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]} + {\lambda} {{\mathcal{D}}_Z(Q_Z,P_Z)}<br />
\end{align}<br />
where <math>\small Q </math> is any non-parametric set of probabilistic encoders, <math>\small {\mathcal{D}}_Z </math> is an arbitrary divergence between <br />
<math>\small Q_Z </math> and <math>\small P_Z </math>, and <math>\small \lambda > 0 </math> is a hyperparameter. The authors propose two different penalties <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) </math> based on adversarial training (GANs) and maximum mean discrepancy (MMD). The authors note that a numerical solution to the dual formulation of the problem has been tried by clipping the weights of the network (to satisfy the Lipschitz condition) and by penalizing the objective with <math>\small \lambda \mathbb{E}(\parallel \nabla f(X) \parallel - 1)^2 </math><br />
<br />
===WAE-GAN: GAN-based===<br />
The first option is to choose <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = D_{JS}(Q_Z,P_Z)</math>, where <math>\small D_{JS} </math> is the Jensen-Shannon divergence metric, and use adversarial training to estimate it. Specifically a discriminator is introduced in the latent space <math>\small {\mathcal{Z}} </math> trying to separate true points sampled from <math>\small P_Z </math> from fake ones sampled from <math>\small Q_Z </math>. This results in Algorithm 1. It is interesting that the min-max problem is moved from the input pixel space to the latent space.<br />
<br />
<br />
[[File:wae-gan.PNG|270px|center]]<br />
<br />
===WAE-MMD: MMD-based===<br />
For a positive definite kernel <math>\small k: {\mathcal{Z}} \times {\mathcal{Z}} \rightarrow {\mathcal{R}}</math>, the following expression is called the maximum mean discrepancy:<br />
\begin{align}<br />
\small {MMD}_k(P_Z,Q_Z) = \parallel \int_{{\mathcal{Z}}} k(z,\cdot)dP_z(z) - \int_{{\mathcal{Z}}} k(z,\cdot)dQ_z(z) \parallel_{\mathcal{H}_k},<br />
\end{align}<br />
<br />
where <math>\mathcal{H}_k</math> is the reproducing kernel Hilbert space of real-valued functions mapping <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. This can be used as a divergence measure and the authors propose to use <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = MMD_k(P_Z,Q_Z) </math>, which leads to Algorithm 2.<br />
<br />
<br />
[[File:wae-mmd.PNG|270px|center]]<br />
<br />
= Comparison with Related Work =<br />
==Auto-Encoders, VAEs and WAEs==<br />
Classical unregularized encoders only minimized the reconstruction cost, and resulted in training points being chaotically scattered across the latent space with holes in between, where the decoder had never been trained. They were hard to sample from and did not provide a useful representation. VAEs circumvented this problem by maximizing a variational lower-bound term comprising of a reconstruction cost and a KL-divergence measure which captures how distinct each training example is from the prior <math>\small P_Z</math>. This however does not guarantee that the overall encoded distribution <math>\small {{\pmb{\mathbb{E}}_{P_X}}}[Q(Z|X)]</math> matches <math>\small P_Z</math>. This is ensured by WAE however, is a direct consequence of our objective function derived from Theorem 1, and is visually represented in the figure below. It is also interesting to note that this also allows WAE to have deterministic encoder-decoder pairs.<br />
<br />
<br />
[[File:vae-wae.PNG|500px|thumb|center|WAE and VAE regularization]]<br />
<br />
<br />
It is also shown that if <math>\small c(x,y)={\parallel x-y \parallel}_2^2</math>, WAE-GAN is equivalent to adversarial autoencoders (AAE). Thus the theory suggests that AAE minimize the 2-Wasserstein distance between <math>\small P_X</math> and <math>\small P_G</math>.<br />
<br />
==OT, W-GAN and WAE==<br />
Literature on OT address computing the OT cost in large scale using SGD and sampling. They approach this task either through the dual formulation, or via a regularized version of the primal. They do not discuss any implications for generative modeling. The author's approach is based on the primal form of OT, we arrive at regularizers which are very different, and our main focus is on generative modeling.<br />
The Wasserstein GAN (W-GAN) minimizes the 1-Wasserstein distance <math>\small W_1(P_X,P_G)</math> for generative modeling. The W-GAN formulation is approached from the dual form and thus it cannot be applied to another other cost <math>\small W_c</math> as the neat form of the Kantorovich-Rubinstein duality holds only for <math>\small W_1</math>. WAE approaches the same problem from the primal form, can be applied to any cost function <math>\small c</math> and comes naturally with an encoder. The constraint on OT in Theorem 1, is relaxed in line with theory on unbalanced optimal transport by adding a penalty or additional divergences to the objective.<br />
<br />
==GANs and WAEs==<br />
Many of the GAN variations including f-GAN and W-GAN come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold in which case they won't be applicable. For works which try to blend adversarial auto-encoder structures, encoders and decoders do not have incentive to be reciprocal. WAE does not necessarily lead to a min-max game and has a clear theoretical foundation for using penalties for regularization.<br />
<br />
There have been many other approaches trying to blend the adversarial training of GANs with auto-encoder architectures. The approach is perhaps the most relevant to the purpose of the model. Some approaches suggest a workaround they propose to include an additional reconstruction term to the objective. Which means WAE does not necessarily lead to a min-max game, uses a different penalty, and has a clear theoretical foundation. Several works used reproducing kernels in context of GANs. A method called WAE-MMD uses MMD to match QZ to the prior PZ in the latent space Z. Typically Z has no more than 100 dimensions and PZ is Gaussian, which allows us to use regular mini-batch sizes to accurately estimate MMD.<br />
<br />
=Experimental Results=<br />
The authors empirically evaluate the proposed WAE generative model by specifically testing if data points are accurately reconstructed, if the latent manifold has reasonable geometry, and if random samples of good visual quality are generated. <br />
<br />
'''Experimental setup:'''<br />
Gaussian prior distribution <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> are used for data-points. The encoder-decoder pairs were deterministic. The convolutional deep neural network for encoder mapping and decoder mapping are similar to DC-GAN with batch normalization. Real world datasets, MNIST with 70k images and CelebA with 203k images were used for training and testing. For interpolations a pair of of held out images, <math>(x,y)</math> from the test set are Auto-encoded (separately), to produce <math>(z_x, z_y)</math> in the latent space. The elements of the latent space are linearly interpolated and decoded to produce the images below. <br />
<br />
'''WAE-GAN and WAE-MMD:'''<br />
In WAE-GAN, the discriminator <math> \small D </math> composed of several fully connected layers with ReLu activations. For WAE-MMD, the RBF kernel failed to penalize outliers and thus the authors resorted to using inverse multiquadratics kernel <math> \small k(x,y)=C/(C+\parallel{x-y}_2^2\parallel) </math>. Trained models are presented in the figure below.<br />
As far as random sampled results are concerned, WAE-GAN seems to be highly unstable but do lead to better matching scores among WAE-GAN, WAE-MMD and VAE. WAE-MMD on the other hand has much more stable training and fairly good quality of sampled results.<br />
<br />
'''Qualitative assessment:'''<br />
In order to quantitatively assess the quality of the generated images, they use the Fréchet Inception Distance and report the results on CelebA (The Fréchet Inception Distance measures the similarity between two sets of images, by comparing the Fréchet distance of multivariate Gaussian distributions fitted to their feature representations. In more detail, let <math> (m,C) </math> denote the mean vector and covariance matrix of the features of the inception network (Szegedy et al. 2017) applied to model samples. Let <math>(m_w,C_w) </math> denote the mean vector and covariance matrix of the features of the inception network applied to real data. Then the Fréchet Inception Distance between the model samples and the real data is <math> ||m-m_w||^2 +\mathrm{tr}(C+C_w-2(CC_w)^{\frac{1}{2}} )\,</math> (Heusel et al. 2017). ) These results confirm that the sampled images from WAE are of better quality than from VAE (score: 82), and WAE-GAN gets a slightly better score (score:42) than WAE-MMD (score:55), which correlates with visual inspection of the images.<br />
<br />
[[File:results.png|800px|thumb|center|Results on MNIST and Celeb-A dataset. In "test reconstructions" (middle row of images), odd rows correspond to the real test points.]]<br />
<br />
<br />
<br />
The authors also heuristically evaluate the sharpness of generated samples using the Laplace filter. The numbers, summarized in Table1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall.<br />
[[File: paper17_Table.png|300px|thumb|center|Qualitative Assessment of Images]]<br />
<br />
'''Network structures:'''<br />
<br />
The Encoder, Decoder, and Adversary architectures used for the MNIST and CelebA datasets are as sown in the following two images:<br />
<br />
[[File:WAE_MNIST.png|700px|thumb|center|Network architectures used to evaluate on the MNIST dataset.]]<br />
<br />
[[File:WAE_CelebA.png|700px|thumb|center|Network architectures used to evaluate on the CelebA dataset.]]<br />
<br />
= Commentary and Conclusion =<br />
This paper presents an interesting theoretical justification for a new family of auto-encoders called Wasserstein Auto-Encoders (WAE). The objective function minimizes the optimal transport cost in the form of the Wasserstein distance, but relaxes theoretical constraints to separate it into a reconstruction cost and a regularization penalty. The regularization penalizes divergences between a prior and the distribution of encoded latent space training data, and is estimated by means of adversarial training (WAE-GAN), or kernel-based techniques (WAE-MMD). They show that they achieve samples of better visual quality than VAEs, while achieving stable training at the same time. They also theoretically show that WAEs are a generalization of adversarial auto-encoders (AAEs).<br />
<br />
Although the paper mentions that encoder-decoder pairs can be deterministic, they do not show the geometry of the latent space that is obtained. It is necessary to study the effect of randomness of encoders on the quality of obtained samples. While this method is evaluated on MNIST and CelebA datasets, it is also important to see their performance on other real world data distributions. The authors do not provide a comprehensive evaluation of WAE-GAN regularization, thus making it hard to comment on whether moving an adversarial problem to the latent space results in less instability. Reasons for better sample quality of WAE-GAN over WAE-MMD also need to be inspected. In the future it would be interesting to investigate different ways to compute the divergences between the encoded distribution and the prior distribution.<br />
<br />
=Open Source Code=<br />
1. https://github.com/tolstikhin/wae <br />
<br />
2. https://github.com/maitek/waae-pytorch<br />
<br />
=Sources=<br />
1. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017<br />
<br />
2. Martin Heusel et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in Neural Information Processing Systems. 2017.<br />
<br />
3. Christian Szegedy et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.<br />
<br />
4. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders, 2017<br />
<br />
5. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders&diff=36374Wasserstein Auto-Encoders2018-04-20T22:10:45Z<p>Ws2chen: /* GANs and WAEs */</p>
<hr />
<div><br />
= Introduction =<br />
Recent years have seen a convergence of two previously distinct approaches: representation learning from high dimensional data, and unsupervised generative modeling. In the field that formed at their intersection, Variational Auto-Encoders (VAEs) and Generative Adversarial Networks (GANs) have emerged to become well-established. VAEs are theoretically elegant but with the drawback that they tend to generate blurry samples when applied to natural images. GANs on the other hand produce better visual quality of sampled images, but come without an encoder, are harder to train and suffer from the mode-collapse problem when the trained model is unable to capture all the variability in the true data distribution. There has been recent research in generating encoder-decoder GANs where an encoder is trained in parallel with the generator, based on the intuition that this will allow the GAN to learn meaningful mapping from the compressed representation to the original image; however, these models also suffer from mode-collapse and perform comparable to vanilla GANs. Thus there has been a push to come up with the best way to combine them together, but a principled unifying framework is yet to be discovered.<br />
<br />
This work proposes a new family of regularized auto-encoders called the Wasserstein Auto-Encoder (WAE). The proposed method provides a novel theoretical insight into setting up an objective function for auto-encoders from the point of view of of optimal transport (OT). This theoretical formulation leads the authors to examine adversarial and maximum mean discrepancy based regularizers for matching a prior and the distribution of encoded data points in the latent space. An empirical evaluation is performed on MNIST and CelebA datasets, where WAE is found to generate samples of better quality than VAE while preserving training stability, encoder-decoder structure and nice latent manifold structure.<br />
<br />
The main contribution of the proposed algorithm is to provide theoretical foundations for using optimal transport cost as the auto-encoder objective function, while blending auto-encoders and GANs in a principled way. It also theoretically and experimentally explores the interesting relationships between WAEs, VAEs and adversarial auto-encoders.<br />
<br />
= Proposed Approach =<br />
==Theory of Optimal Transport and Wasserstein Distance==<br />
Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. <br />
A simple case where the probability domain is discrete is presented below.<br />
<br />
<br />
[[File:em_distance.PNG|thumb|upright=1.4|center|Step-by-step plan of moving dirt between piles in ''P'' and ''Q'' to make them match (''W'' = 5).]]<br />
<br />
<br />
When dealing with the continuous probability domain, the EM distance or the minimum one among the costs of all dirt moving solutions becomes:<br />
\begin{align}<br />
\small W(p_r, p_g) = \underset{\gamma\sim\Pi(p_r, p_g)} {\inf}\pmb{\mathbb{E}}_{(x,y)\sim\gamma}[\parallel x-y\parallel]<br />
\end{align}<br />
<br />
Where <math>\Pi(p_r, p_g)</math> is the set of all joint probability distributions with marginals <math>p_r</math> and <math>p_g</math>. Here the distribution <math>\gamma</math> is called a transport plan because its marginal structure gives some intuition that it represents the amount of probability mass to be moved from x to y. This intuition can be explained by looking at the following equation.<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dx = p_g(y)<br />
\end{align}<br />
Which means that the total amount of dirt moved to point <math>y</math> is <math>p_g(y)</math>. Similarly, we have:<br />
<br />
\begin{align}<br />
\int\gamma(x, y)dy = p_r(x)<br />
\end{align}<br />
Which means that the total amount of dirt moved out of point <math>x</math> is <math>p_r(x)</math><br />
<br />
The Wasserstein distance or the cost of Optimal Transport (OT) provides a much weaker topology, which informally means that it makes it easier for a sequence of distribution to converge as compared to other ''f''-divergences. This is particularly important in applications where data is supported on low dimensional manifolds in the input space. As a result, stronger notions of distances such as KL-divergence, often max out, providing no useful gradients for training. In contrast, OT has a much nicer linear behaviour even upon saturation. It can be shown that the Wasserstein distance has guarantees of continuity and differentiability (Arjovsky et al., 2017). Moreover, Arjovsky et al. show there is a nice relationship between the magnitude of the Wasserstein distance and the distance between distributions; a smaller distance nicely corresponds to a smaller distance between the two distributions, and vice versa.<br />
<br />
==Problem Formulation and Notation==<br />
In this paper, calligraphic letters, i.e. <math>\small {\mathcal{X}}</math>, are used for sets, capital letters, i.e. <math>\small X</math>, are used for random variables and lower case letters, i.e. <math>\small x</math>, for their values. Probability distributions are denoted with capital letters, i.e. <math>\small P(X)</math>, and corresponding densities with lower case letters, i.e. <math>\small p(x)</math>.<br />
<br />
This work aims to minimize OT <math>\small W_c(P_X, P_G)</math> between the true (but unknown) data distribution <math>\small P_X</math> and a latent variable model <math>\small P_G</math> specified by the prior distribution <math>\small P_Z</math> of latent codes <math>\small Z \in \pmb{\mathbb{Z}}</math> and the generative model <math>\small P_G(X|Z)</math> of the data points <math>\small X \in \pmb{\mathbb{X}}</math> given <math>\small Z</math>. <br />
<br />
Kantorovich's formulation of the OT problem is given by:<br />
\begin{align}<br />
\small W_c(P_X, P_G) := \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]}<br />
\end{align}<br />
where <math>\small c(x,y)</math> is any measurable cost function and <math>\small {\mathcal{P}(X \sim P_X,Y \sim P_G)}</math> is a set of all joint distributions of <math>\small (X,Y)</math> with marginals <math>\small P_X</math> and <math>\small P_G</math>. When <math>\small c(x,y)=d(x,y)</math>, the following Kantorovich-Rubinstein duality holds for the <math>\small 1^{st}</math> root of <math>\small W_c</math>:<br />
\begin{align}<br />
\small W_1(P_X, P_G) := \underset{f \in {\mathcal{F_L}}} {\sup} {\pmb{\mathbb{E}}_{X \sim P_X}[f(X)]} -{\pmb{\mathbb{E}}_{Y \sim P_G}[f(Y)]}<br />
\end{align}<br />
where <math>\small {\mathcal{F_L}}</math> is the class of all bounded [https://en.wikipedia.org/wiki/Lipschitz_continuity Lipschitz continuous functions]. A reference that provides an intuitive explanation for how the Kantorovich-Rubinstein duality was applied in this case is [https://vincentherrmann.github.io/blog/wasserstein/ here].<br />
<br />
==Wasserstein Auto-Encoders==<br />
The proposed method focuses on latent variables <math>\small P_G </math> defined by a two step procedure, where first a code <math>\small Z</math> is sampled from a fixed prior distribution <math>\small P_Z</math> on a latent space <math>\small {\mathcal{Z}}</math> and then <math>\small Z</math> is mapped to the image <math>\small X \in {\mathcal{X}}</math> with a transformation. This results in a density of the form<br />
\begin{align}<br />
\small p_G(x) := \int_{{\mathcal{Z}}} p_G(x|z)p_z(z)dz, \forall x\in{\mathcal{X}}<br />
\end{align}<br />
assuming all the densities are properly defined. It turns out that if the focus is only on generative models deterministically mapping <math>\small Z </math> to <math>\small X = G(Z) </math>, then the OT cost takes a much simpler form as stated below by Theorem 1.<br />
<br />
'''Theorem 1''' For any function <math>\small G:{\mathcal{Z}} \rightarrow {\mathcal{X}}</math>, where <math>\small Q(Z) </math> is the marginal distribution of <math>\small Z </math> when <math>\small X \sim P_X </math> and <math>\small Z \sim Q(Z|X) </math>,<br />
\begin{align}<br />
\small \underset{\Gamma\sim {\mathcal{P}}(X \sim P_X, Y \sim P_G)}{\inf} {\pmb{\mathbb{E}}_{(X,Y)\sim\Gamma}[c(X,Y)]} = \underset{Q : Q_z=P_z}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]}<br />
\end{align}<br />
This essentially means that instead of finding a coupling <math>\small \Gamma </math> between two random variables living in the <math>\small {\mathcal{X}} </math> space, one distributed according to <math>\small P_X </math> and the other one according to <math>\small P_G </math>, it is sufficient to find a conditional distribution <math>\small Q(Z|X) </math> such that its <math>\small Z </math> marginal <math>\small Q_Z(Z) := {\pmb{\mathbb{E}}_{X \sim P_X}[Q(Z|X)]} </math> is identical to the prior distribution <math>\small P_Z </math>. In order to implement a numerical solution to Theorem 1, the constraints on <math>\small Q(Z|X) </math> and <math>\small P_Z </math> are relaxed and a penalty function is added to the objective leading to the WAE objective function given by:<br />
<br />
\begin{align}<br />
\small D_{WAE}(P_X, P_G):= \underset{Q(Z|X) \in Q}{\inf} {{\pmb{\mathbb{E}}_{P_X}}{\pmb{\mathbb{E}}_{Q(Z|X)}}[c(X,G(Z))]} + {\lambda} {{\mathcal{D}}_Z(Q_Z,P_Z)}<br />
\end{align}<br />
where <math>\small Q </math> is any non-parametric set of probabilistic encoders, <math>\small {\mathcal{D}}_Z </math> is an arbitrary divergence between <br />
<math>\small Q_Z </math> and <math>\small P_Z </math>, and <math>\small \lambda > 0 </math> is a hyperparameter. The authors propose two different penalties <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) </math> based on adversarial training (GANs) and maximum mean discrepancy (MMD). The authors note that a numerical solution to the dual formulation of the problem has been tried by clipping the weights of the network (to satisfy the Lipschitz condition) and by penalizing the objective with <math>\small \lambda \mathbb{E}(\parallel \nabla f(X) \parallel - 1)^2 </math><br />
<br />
===WAE-GAN: GAN-based===<br />
The first option is to choose <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = D_{JS}(Q_Z,P_Z)</math>, where <math>\small D_{JS} </math> is the Jensen-Shannon divergence metric, and use adversarial training to estimate it. Specifically a discriminator is introduced in the latent space <math>\small {\mathcal{Z}} </math> trying to separate true points sampled from <math>\small P_Z </math> from fake ones sampled from <math>\small Q_Z </math>. This results in Algorithm 1. It is interesting that the min-max problem is moved from the input pixel space to the latent space.<br />
<br />
<br />
[[File:wae-gan.PNG|270px|center]]<br />
<br />
===WAE-MMD: MMD-based===<br />
For a positive definite kernel <math>\small k: {\mathcal{Z}} \times {\mathcal{Z}} \rightarrow {\mathcal{R}}</math>, the following expression is called the maximum mean discrepancy:<br />
\begin{align}<br />
\small {MMD}_k(P_Z,Q_Z) = \parallel \int_{{\mathcal{Z}}} k(z,\cdot)dP_z(z) - \int_{{\mathcal{Z}}} k(z,\cdot)dQ_z(z) \parallel_{\mathcal{H}_k},<br />
\end{align}<br />
<br />
where <math>\mathcal{H}_k</math> is the reproducing kernel Hilbert space of real-valued functions mapping <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. This can be used as a divergence measure and the authors propose to use <math>\small {\mathcal{D}}_Z(Q_Z,P_Z) = MMD_k(P_Z,Q_Z) </math>, which leads to Algorithm 2.<br />
<br />
<br />
[[File:wae-mmd.PNG|270px|center]]<br />
<br />
= Comparison with Related Work =<br />
==Auto-Encoders, VAEs and WAEs==<br />
Classical unregularized encoders only minimized the reconstruction cost, and resulted in training points being chaotically scattered across the latent space with holes in between, where the decoder had never been trained. They were hard to sample from and did not provide a useful representation. VAEs circumvented this problem by maximizing a variational lower-bound term comprising of a reconstruction cost and a KL-divergence measure which captures how distinct each training example is from the prior <math>\small P_Z</math>. This however does not guarantee that the overall encoded distribution <math>\small {{\pmb{\mathbb{E}}_{P_X}}}[Q(Z|X)]</math> matches <math>\small P_Z</math>. This is ensured by WAE however, is a direct consequence of our objective function derived from Theorem 1, and is visually represented in the figure below. It is also interesting to note that this also allows WAE to have deterministic encoder-decoder pairs.<br />
<br />
<br />
[[File:vae-wae.PNG|500px|thumb|center|WAE and VAE regularization]]<br />
<br />
<br />
It is also shown that if <math>\small c(x,y)={\parallel x-y \parallel}_2^2</math>, WAE-GAN is equivalent to adversarial autoencoders (AAE). Thus the theory suggests that AAE minimize the 2-Wasserstein distance between <math>\small P_X</math> and <math>\small P_G</math>.<br />
<br />
==OT, W-GAN and WAE==<br />
Literature on OT address computing the OT cost in large scale using SGD and sampling. They approach this task either through the dual formulation, or via a regularized version of the primal. They do not discuss any implications for generative modeling. The author's approach is based on the primal form of OT, we arrive at regularizers which are very different, and our main focus is on generative modeling.<br />
The Wasserstein GAN (W-GAN) minimizes the 1-Wasserstein distance <math>\small W_1(P_X,P_G)</math> for generative modeling. The W-GAN formulation is approached from the dual form and thus it cannot be applied to another other cost <math>\small W_c</math> as the neat form of the Kantorovich-Rubinstein duality holds only for <math>\small W_1</math>. WAE approaches the same problem from the primal form, can be applied to any cost function <math>\small c</math> and comes naturally with an encoder. The constraint on OT in Theorem 1, is relaxed in line with theory on unbalanced optimal transport by adding a penalty or additional divergences to the objective.<br />
<br />
==GANs and WAEs==<br />
Many of the GAN variations including f-GAN and W-GAN come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold in which case they won't be applicable. For works which try to blend adversarial auto-encoder structures, encoders and decoders do not have incentive to be reciprocal. WAE does not necessarily lead to a min-max game and has a clear theoretical foundation for using penalties for regularization.<br />
<br />
<br />
<br />
There have been many other approaches trying to blend the adversarial training of GANs with auto-encoder architectures. The approach is perhaps the most relevant to the purpose of the model. Some approaches suggest a workaround they propose to include an additional reconstruction term to the objective. Which means WAE does not necessarily lead to a min-max game, uses a different penalty, and has a clear theoretical foundation. <br />
<br />
Several works used reproducing kernels in context of GANs. [23, 24] use MMD with a fixed kernel k to match PX and PG directly in the input space X. These methods have been criticised to require larger mini-batches during training: estimating MMDk(PX,PG) requires number of samples roughly proportional to the dimensionality of the input space X [25] which is typically larger than 103. [26] take a similar approach but further train k adversarially so as to arrive at a meaningful loss function. WAE-MMD uses MMD to match QZ to the prior PZ in the latent space Z. Typically Z has no more than 100 dimensions and PZ is Gaussian, which allows us to use regular mini-batch sizes to accurately estimate MMD.<br />
<br />
=Experimental Results=<br />
The authors empirically evaluate the proposed WAE generative model by specifically testing if data points are accurately reconstructed, if the latent manifold has reasonable geometry, and if random samples of good visual quality are generated. <br />
<br />
'''Experimental setup:'''<br />
Gaussian prior distribution <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> are used for data-points. The encoder-decoder pairs were deterministic. The convolutional deep neural network for encoder mapping and decoder mapping are similar to DC-GAN with batch normalization. Real world datasets, MNIST with 70k images and CelebA with 203k images were used for training and testing. For interpolations a pair of of held out images, <math>(x,y)</math> from the test set are Auto-encoded (separately), to produce <math>(z_x, z_y)</math> in the latent space. The elements of the latent space are linearly interpolated and decoded to produce the images below. <br />
<br />
'''WAE-GAN and WAE-MMD:'''<br />
In WAE-GAN, the discriminator <math> \small D </math> composed of several fully connected layers with ReLu activations. For WAE-MMD, the RBF kernel failed to penalize outliers and thus the authors resorted to using inverse multiquadratics kernel <math> \small k(x,y)=C/(C+\parallel{x-y}_2^2\parallel) </math>. Trained models are presented in the figure below.<br />
As far as random sampled results are concerned, WAE-GAN seems to be highly unstable but do lead to better matching scores among WAE-GAN, WAE-MMD and VAE. WAE-MMD on the other hand has much more stable training and fairly good quality of sampled results.<br />
<br />
'''Qualitative assessment:'''<br />
In order to quantitatively assess the quality of the generated images, they use the Fréchet Inception Distance and report the results on CelebA (The Fréchet Inception Distance measures the similarity between two sets of images, by comparing the Fréchet distance of multivariate Gaussian distributions fitted to their feature representations. In more detail, let <math> (m,C) </math> denote the mean vector and covariance matrix of the features of the inception network (Szegedy et al. 2017) applied to model samples. Let <math>(m_w,C_w) </math> denote the mean vector and covariance matrix of the features of the inception network applied to real data. Then the Fréchet Inception Distance between the model samples and the real data is <math> ||m-m_w||^2 +\mathrm{tr}(C+C_w-2(CC_w)^{\frac{1}{2}} )\,</math> (Heusel et al. 2017). ) These results confirm that the sampled images from WAE are of better quality than from VAE (score: 82), and WAE-GAN gets a slightly better score (score:42) than WAE-MMD (score:55), which correlates with visual inspection of the images.<br />
<br />
[[File:results.png|800px|thumb|center|Results on MNIST and Celeb-A dataset. In "test reconstructions" (middle row of images), odd rows correspond to the real test points.]]<br />
<br />
<br />
<br />
The authors also heuristically evaluate the sharpness of generated samples using the Laplace filter. The numbers, summarized in Table1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall.<br />
[[File: paper17_Table.png|300px|thumb|center|Qualitative Assessment of Images]]<br />
<br />
'''Network structures:'''<br />
<br />
The Encoder, Decoder, and Adversary architectures used for the MNIST and CelebA datasets are as sown in the following two images:<br />
<br />
[[File:WAE_MNIST.png|700px|thumb|center|Network architectures used to evaluate on the MNIST dataset.]]<br />
<br />
[[File:WAE_CelebA.png|700px|thumb|center|Network architectures used to evaluate on the CelebA dataset.]]<br />
<br />
= Commentary and Conclusion =<br />
This paper presents an interesting theoretical justification for a new family of auto-encoders called Wasserstein Auto-Encoders (WAE). The objective function minimizes the optimal transport cost in the form of the Wasserstein distance, but relaxes theoretical constraints to separate it into a reconstruction cost and a regularization penalty. The regularization penalizes divergences between a prior and the distribution of encoded latent space training data, and is estimated by means of adversarial training (WAE-GAN), or kernel-based techniques (WAE-MMD). They show that they achieve samples of better visual quality than VAEs, while achieving stable training at the same time. They also theoretically show that WAEs are a generalization of adversarial auto-encoders (AAEs).<br />
<br />
Although the paper mentions that encoder-decoder pairs can be deterministic, they do not show the geometry of the latent space that is obtained. It is necessary to study the effect of randomness of encoders on the quality of obtained samples. While this method is evaluated on MNIST and CelebA datasets, it is also important to see their performance on other real world data distributions. The authors do not provide a comprehensive evaluation of WAE-GAN regularization, thus making it hard to comment on whether moving an adversarial problem to the latent space results in less instability. Reasons for better sample quality of WAE-GAN over WAE-MMD also need to be inspected. In the future it would be interesting to investigate different ways to compute the divergences between the encoded distribution and the prior distribution.<br />
<br />
=Open Source Code=<br />
1. https://github.com/tolstikhin/wae <br />
<br />
2. https://github.com/maitek/waae-pytorch<br />
<br />
=Sources=<br />
1. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017<br />
<br />
2. Martin Heusel et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in Neural Information Processing Systems. 2017.<br />
<br />
3. Christian Szegedy et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.<br />
<br />
4. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders, 2017<br />
<br />
5. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=36373Multi-scale Dense Networks for Resource Efficient Image Classification2018-04-20T21:39:18Z<p>Ws2chen: /* Training of Early Classifiers Interferes with Later Classifiers */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources. For example, the winner of the COCO 2016 competition was an [http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf ensemble of CNNs], which are likely far too resource-heavy to be used in any resource-limited application.<br />
<br />
Note: <br />
* There are two kinds of efficiency in this context, computational efficiency and resource efficiency.<br />
* There are multiple cases for hard examples, such as large number of classification label, randomly blocked or zoomed images, or even complicated background that makes image recognition even more difficult. <br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
* Anytime Prediction: What is the best prediction the network can provide when suddenly prompted?<br />
* Budget Batch Predictions: Given a maximum amount of computational resources, how well does the network do on the batch?<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Much of the existing work on convolution networks that are computationally efficient at test time focus on reducing model size after training. Many existing methods for refining an accurate network to be more efficient include weight pruning [3,4,5], quantization of weights [6,7] (during or after training), and knowledge distillation [8,9], which trains smaller student networks to reproduce the output of a much larger teacher network. The proposed work differs from these approaches as it trains a single model which trades computation efficiency for accuracy at test time without re-training or finetuning.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network. (For example, a Branchynet (Teerapittayanon et al., 2016) is a deeply supervised network explicitly designed for efficiency. A Branchynet has multiple exit branches at various depths, each leading to a softmax classifier. At test time, if a classifier on an early exit branch makes a confident prediction, the rest of network need not be evaluated. However, unlike in MSDnets, in Branchynets early classifiers to not have access to low-resolution features. )<br />
* The feature concatenation method from DenseNets(Dense net is CNN with shorter connections close to input and output) allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Problem Setup =<br />
The authors consider two settings that impose computational constraints at prediction time.<br />
<br />
== Anytime Prediction ==<br />
In the anytime prediction setting (Grubb & Bagnell, 2012), there is a finite computational budget <math>B > 0</math> available for each test example <math>x</math>. Once the budget is exhausted, the prediction for the class is output using early exit. The budget is nondeterministic and varies per test instance.<br />
They assume that the budget is drawn from some joint distribution <math>P(x,B)</math>. They denote the loss of a model <math>f(x)</math> that has to produce a prediction for instance x with a budget of <math>B</math> by <math>L(f(x),B)</math>. The goal of the anytime learner is to minimize the expected loss under the budget distribution <math>L(f)=\mathop{\mathbb{E}}[L(f(x),B)]_{P(x,B)}</math>.<br />
<br />
== Budgeted Batch Classification ==<br />
In the budgeted batch classification setting, the model needs to classify a set of examples <math>D_{test} = {x_1, . . . , x_M}</math> within a finite computational budget <math>B > 0</math> that is known in advance. The learner aims to minimize the loss across all examples in the <math>D_{test}</math>, within a cumulative cost bounded by <math>B</math>, which is denoted as <math>L(f(D_{test}),B)</math> for some suitable loss function <math>L</math>.<br />
<br />
= Multi-Scale Dense Networks =<br />
Two solutions to the problems mentioned above: <br />
<br />
* Train multiple networks of increasing capacity, and evaluate them at test time.<br />
**Anytime setting: the evaluation can be stopped at any time point and return the most recent prediction<br />
**Batch setting: the evaluation is stopped with no continuous training when the network is good enough.<br />
* Build a deep network with a cascade of classifiers operating on the features of internal layers.<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
The term coarse level feature refers to a set of filters in a CNN with low resolution. There are several ways to create such features. These methods are typically refereed to as down sampling. Some example of layers that perform this function are: max pooling, average pooling and convolution with strides. In this architecture, convolution with strides will be used to create coarse features. <br />
<br />
'''Concern:''' Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.<br />
<br />
'''Solution:''' To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The feature maps at a particular layer and scale are computed by concatenating results from up to two convolutions: a standard convolution is first applied to same-scale features from the previous layer to pass on high-resolution information that subsequent layers can use to construct better coarse features, and if possible, a strided convolution is also applied on the finer-scale feature map from the previous layer to produce coarser features amenable to classification. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
'''Concern:''' When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
'''Solution:''' MSDNets use dense connectivity to avoid this issue. DenseNet suffers much less from this effect. Dense connectivity connects each layer with all subsequent layers and allows later layers to bypass features optimized for the short-term, to maintain the high accuracy of the final classifier. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored. Which means If an earlier layer collapses information to generate short-term features, the lost information can be recovered through the direct connection to its preceding layer. The final classifier’s performance becomes (more or less) independent of the location of the intermediate classifier.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers consists of two convolutional layers, an average pooling layer and a linear layer and are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss: <br />
<br />
<math>\frac{1}{|\mathcal{D}|} \sum_{x,y \in \mathcal{D}} \sum_{k}w_k L(f_k) </math><br />
<br />
Here <math>w_i</math> represents the weights and <math>L(f_k)</math> represents the logistic loss of each classifier. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
=== Network Reduction and Lazy Evaluation ===<br />
There are two ways to reduce the computational needs of MSDNets:<br />
<br />
# Reduce the size of the network by splitting it into <math>S</math> blocks along the depth dimension and keeping the <math>(S-i+1)</math> scales in the <math>i^{\text{th}}</math> block.Whenever a scale is removed, a transition layer merges the concatenated features using 1x1 convolution and feeds the fine grained features to coarser scales.<br />
# Remove unnecessary computations: Group the computation in "diagonal blocks"; this propagates the example along paths that are required for the evaluation of the next classifier.<br />
<br />
The strategy of minimizing unnecessary computations when the computational budget is over is known as the ''lazy evaluation''.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases. The authors attributed this to the fact that MSDNets are able to produce low-resolution feature maps well-suited for classification after just a few layers, in contrast to the high-resolution feature maps in early layers of ResNets or DenseNets. Ensemble networks need to repeat computations of similar low-level features repeatedly when new models need to be evaluated, so their accuracy results do not increase as fast when computational budget increases. <br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]] [[File:cifar10msdnet.png | 700px|thumb|center|CIFAR-10 results.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
The following figure shows examples of what was deemed "easy" and "hard" examples by the network. The top row contains images of either red wine or volcanos that were easily classified, thus exiting the network early and reducing required computations. The bottom row contains examples of "hard" images that were incorrectly classified by the first classifier but were correctly classified by the last layer.<br />
<br />
[[File:MSDNet_visualizingearlyclassifying.png | 700px|thumb|center|Examples of "hard"/"easy" classification]]<br />
<br />
= Ablation study =<br />
Additional experiments were performed to shed light on multi-scale feature maps, dense connectivity, and intermediate classifiers. This experiment started with an MSDNet with six intermediate classifiers and each of these components were removed, one at a time. To make our comparisons fair, the computational costs of the full networks were kept similar by adapting the network width. After removing all the three components, a VGG-like convolutional network is obtained. The classification accuracy of all classifiers is shown in the image below.<br />
<br />
[[File:Screenshot_from_2018-03-29_14-58-03.png]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper, written by the authors: https://github.com/gaohuang/MSDNet<br />
<br />
= Sources =<br />
# Huang, G., Chen, D., Li, T., Wu, F., Maaten, L., & Weinberger, K. Q. (n.d.). Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018. doi:1703.09844 <br />
# Huang, G. (n.d.). Gaohuang/MSDNet. Retrieved March 25, 2018, from https://github.com/gaohuang/MSDNet<br />
# LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal brain damage." Advances in neural information processing systems. 1990.<br />
# Hassibi, Babak, David G. Stork, and Gregory J. Wolff. "Optimal brain surgeon and general network pruning." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.<br />
# Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).<br />
# Hubara, Itay, et al. "Binarized neural networks." Advances in neural information processing systems. 2016.<br />
# Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016.<br />
# Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In ACM SIGKDD, pp. 535–541. ACM, 2006.<br />
# Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.<br />
# Teerapittayanon, Surat, Bradley McDanel, and H. T. Kung. "Branchynet: Fast inference via early exiting from deep neural networks." Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=36372Multi-scale Dense Networks for Resource Efficient Image Classification2018-04-20T21:33:51Z<p>Ws2chen: /* Training of Early Classifiers Interferes with Later Classifiers */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources. For example, the winner of the COCO 2016 competition was an [http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf ensemble of CNNs], which are likely far too resource-heavy to be used in any resource-limited application.<br />
<br />
Note: <br />
* There are two kinds of efficiency in this context, computational efficiency and resource efficiency.<br />
* There are multiple cases for hard examples, such as large number of classification label, randomly blocked or zoomed images, or even complicated background that makes image recognition even more difficult. <br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
* Anytime Prediction: What is the best prediction the network can provide when suddenly prompted?<br />
* Budget Batch Predictions: Given a maximum amount of computational resources, how well does the network do on the batch?<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Much of the existing work on convolution networks that are computationally efficient at test time focus on reducing model size after training. Many existing methods for refining an accurate network to be more efficient include weight pruning [3,4,5], quantization of weights [6,7] (during or after training), and knowledge distillation [8,9], which trains smaller student networks to reproduce the output of a much larger teacher network. The proposed work differs from these approaches as it trains a single model which trades computation efficiency for accuracy at test time without re-training or finetuning.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network. (For example, a Branchynet (Teerapittayanon et al., 2016) is a deeply supervised network explicitly designed for efficiency. A Branchynet has multiple exit branches at various depths, each leading to a softmax classifier. At test time, if a classifier on an early exit branch makes a confident prediction, the rest of network need not be evaluated. However, unlike in MSDnets, in Branchynets early classifiers to not have access to low-resolution features. )<br />
* The feature concatenation method from DenseNets(Dense net is CNN with shorter connections close to input and output) allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Problem Setup =<br />
The authors consider two settings that impose computational constraints at prediction time.<br />
<br />
== Anytime Prediction ==<br />
In the anytime prediction setting (Grubb & Bagnell, 2012), there is a finite computational budget <math>B > 0</math> available for each test example <math>x</math>. Once the budget is exhausted, the prediction for the class is output using early exit. The budget is nondeterministic and varies per test instance.<br />
They assume that the budget is drawn from some joint distribution <math>P(x,B)</math>. They denote the loss of a model <math>f(x)</math> that has to produce a prediction for instance x with a budget of <math>B</math> by <math>L(f(x),B)</math>. The goal of the anytime learner is to minimize the expected loss under the budget distribution <math>L(f)=\mathop{\mathbb{E}}[L(f(x),B)]_{P(x,B)}</math>.<br />
<br />
== Budgeted Batch Classification ==<br />
In the budgeted batch classification setting, the model needs to classify a set of examples <math>D_{test} = {x_1, . . . , x_M}</math> within a finite computational budget <math>B > 0</math> that is known in advance. The learner aims to minimize the loss across all examples in the <math>D_{test}</math>, within a cumulative cost bounded by <math>B</math>, which is denoted as <math>L(f(D_{test}),B)</math> for some suitable loss function <math>L</math>.<br />
<br />
= Multi-Scale Dense Networks =<br />
Two solutions to the problems mentioned above: <br />
<br />
* Train multiple networks of increasing capacity, and evaluate them at test time.<br />
**Anytime setting: the evaluation can be stopped at any time point and return the most recent prediction<br />
**Batch setting: the evaluation is stopped with no continuous training when the network is good enough.<br />
* Build a deep network with a cascade of classifiers operating on the features of internal layers.<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
The term coarse level feature refers to a set of filters in a CNN with low resolution. There are several ways to create such features. These methods are typically refereed to as down sampling. Some example of layers that perform this function are: max pooling, average pooling and convolution with strides. In this architecture, convolution with strides will be used to create coarse features. <br />
<br />
'''Concern:''' Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.<br />
<br />
'''Solution:''' To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The feature maps at a particular layer and scale are computed by concatenating results from up to two convolutions: a standard convolution is first applied to same-scale features from the previous layer to pass on high-resolution information that subsequent layers can use to construct better coarse features, and if possible, a strided convolution is also applied on the finer-scale feature map from the previous layer to produce coarser features amenable to classification. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
'''Concern:''' When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
'''Solution:''' MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers consists of two convolutional layers, an average pooling layer and a linear layer and are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss: <br />
<br />
<math>\frac{1}{|\mathcal{D}|} \sum_{x,y \in \mathcal{D}} \sum_{k}w_k L(f_k) </math><br />
<br />
Here <math>w_i</math> represents the weights and <math>L(f_k)</math> represents the logistic loss of each classifier. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
=== Network Reduction and Lazy Evaluation ===<br />
There are two ways to reduce the computational needs of MSDNets:<br />
<br />
# Reduce the size of the network by splitting it into <math>S</math> blocks along the depth dimension and keeping the <math>(S-i+1)</math> scales in the <math>i^{\text{th}}</math> block.Whenever a scale is removed, a transition layer merges the concatenated features using 1x1 convolution and feeds the fine grained features to coarser scales.<br />
# Remove unnecessary computations: Group the computation in "diagonal blocks"; this propagates the example along paths that are required for the evaluation of the next classifier.<br />
<br />
The strategy of minimizing unnecessary computations when the computational budget is over is known as the ''lazy evaluation''.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases. The authors attributed this to the fact that MSDNets are able to produce low-resolution feature maps well-suited for classification after just a few layers, in contrast to the high-resolution feature maps in early layers of ResNets or DenseNets. Ensemble networks need to repeat computations of similar low-level features repeatedly when new models need to be evaluated, so their accuracy results do not increase as fast when computational budget increases. <br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]] [[File:cifar10msdnet.png | 700px|thumb|center|CIFAR-10 results.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
The following figure shows examples of what was deemed "easy" and "hard" examples by the network. The top row contains images of either red wine or volcanos that were easily classified, thus exiting the network early and reducing required computations. The bottom row contains examples of "hard" images that were incorrectly classified by the first classifier but were correctly classified by the last layer.<br />
<br />
[[File:MSDNet_visualizingearlyclassifying.png | 700px|thumb|center|Examples of "hard"/"easy" classification]]<br />
<br />
= Ablation study =<br />
Additional experiments were performed to shed light on multi-scale feature maps, dense connectivity, and intermediate classifiers. This experiment started with an MSDNet with six intermediate classifiers and each of these components were removed, one at a time. To make our comparisons fair, the computational costs of the full networks were kept similar by adapting the network width. After removing all the three components, a VGG-like convolutional network is obtained. The classification accuracy of all classifiers is shown in the image below.<br />
<br />
[[File:Screenshot_from_2018-03-29_14-58-03.png]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper, written by the authors: https://github.com/gaohuang/MSDNet<br />
<br />
= Sources =<br />
# Huang, G., Chen, D., Li, T., Wu, F., Maaten, L., & Weinberger, K. Q. (n.d.). Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018. doi:1703.09844 <br />
# Huang, G. (n.d.). Gaohuang/MSDNet. Retrieved March 25, 2018, from https://github.com/gaohuang/MSDNet<br />
# LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal brain damage." Advances in neural information processing systems. 1990.<br />
# Hassibi, Babak, David G. Stork, and Gregory J. Wolff. "Optimal brain surgeon and general network pruning." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.<br />
# Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).<br />
# Hubara, Itay, et al. "Binarized neural networks." Advances in neural information processing systems. 2016.<br />
# Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016.<br />
# Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In ACM SIGKDD, pp. 535–541. ACM, 2006.<br />
# Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.<br />
# Teerapittayanon, Surat, Bradley McDanel, and H. T. Kung. "Branchynet: Fast inference via early exiting from deep neural networks." Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=36371Multi-scale Dense Networks for Resource Efficient Image Classification2018-04-20T21:33:24Z<p>Ws2chen: /* Training of Early Classifiers Interferes with Later Classifiers */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources. For example, the winner of the COCO 2016 competition was an [http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf ensemble of CNNs], which are likely far too resource-heavy to be used in any resource-limited application.<br />
<br />
Note: <br />
* There are two kinds of efficiency in this context, computational efficiency and resource efficiency.<br />
* There are multiple cases for hard examples, such as large number of classification label, randomly blocked or zoomed images, or even complicated background that makes image recognition even more difficult. <br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
* Anytime Prediction: What is the best prediction the network can provide when suddenly prompted?<br />
* Budget Batch Predictions: Given a maximum amount of computational resources, how well does the network do on the batch?<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Much of the existing work on convolution networks that are computationally efficient at test time focus on reducing model size after training. Many existing methods for refining an accurate network to be more efficient include weight pruning [3,4,5], quantization of weights [6,7] (during or after training), and knowledge distillation [8,9], which trains smaller student networks to reproduce the output of a much larger teacher network. The proposed work differs from these approaches as it trains a single model which trades computation efficiency for accuracy at test time without re-training or finetuning.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network. (For example, a Branchynet (Teerapittayanon et al., 2016) is a deeply supervised network explicitly designed for efficiency. A Branchynet has multiple exit branches at various depths, each leading to a softmax classifier. At test time, if a classifier on an early exit branch makes a confident prediction, the rest of network need not be evaluated. However, unlike in MSDnets, in Branchynets early classifiers to not have access to low-resolution features. )<br />
* The feature concatenation method from DenseNets(Dense net is CNN with shorter connections close to input and output) allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Problem Setup =<br />
The authors consider two settings that impose computational constraints at prediction time.<br />
<br />
== Anytime Prediction ==<br />
In the anytime prediction setting (Grubb & Bagnell, 2012), there is a finite computational budget <math>B > 0</math> available for each test example <math>x</math>. Once the budget is exhausted, the prediction for the class is output using early exit. The budget is nondeterministic and varies per test instance.<br />
They assume that the budget is drawn from some joint distribution <math>P(x,B)</math>. They denote the loss of a model <math>f(x)</math> that has to produce a prediction for instance x with a budget of <math>B</math> by <math>L(f(x),B)</math>. The goal of the anytime learner is to minimize the expected loss under the budget distribution <math>L(f)=\mathop{\mathbb{E}}[L(f(x),B)]_{P(x,B)}</math>.<br />
<br />
== Budgeted Batch Classification ==<br />
In the budgeted batch classification setting, the model needs to classify a set of examples <math>D_{test} = {x_1, . . . , x_M}</math> within a finite computational budget <math>B > 0</math> that is known in advance. The learner aims to minimize the loss across all examples in the <math>D_{test}</math>, within a cumulative cost bounded by <math>B</math>, which is denoted as <math>L(f(D_{test}),B)</math> for some suitable loss function <math>L</math>.<br />
<br />
= Multi-Scale Dense Networks =<br />
Two solutions to the problems mentioned above: <br />
<br />
* Train multiple networks of increasing capacity, and evaluate them at test time.<br />
**Anytime setting: the evaluation can be stopped at any time point and return the most recent prediction<br />
**Batch setting: the evaluation is stopped with no continuous training when the network is good enough.<br />
* Build a deep network with a cascade of classifiers operating on the features of internal layers.<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
The term coarse level feature refers to a set of filters in a CNN with low resolution. There are several ways to create such features. These methods are typically refereed to as down sampling. Some example of layers that perform this function are: max pooling, average pooling and convolution with strides. In this architecture, convolution with strides will be used to create coarse features. <br />
<br />
'''Concern:''' Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.<br />
<br />
'''Solution:''' To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The feature maps at a particular layer and scale are computed by concatenating results from up to two convolutions: a standard convolution is first applied to same-scale features from the previous layer to pass on high-resolution information that subsequent layers can use to construct better coarse features, and if possible, a strided convolution is also applied on the finer-scale feature map from the previous layer to produce coarser features amenable to classification. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
'''Concerns:''' When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
'''Solution:''' MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers consists of two convolutional layers, an average pooling layer and a linear layer and are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss: <br />
<br />
<math>\frac{1}{|\mathcal{D}|} \sum_{x,y \in \mathcal{D}} \sum_{k}w_k L(f_k) </math><br />
<br />
Here <math>w_i</math> represents the weights and <math>L(f_k)</math> represents the logistic loss of each classifier. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
=== Network Reduction and Lazy Evaluation ===<br />
There are two ways to reduce the computational needs of MSDNets:<br />
<br />
# Reduce the size of the network by splitting it into <math>S</math> blocks along the depth dimension and keeping the <math>(S-i+1)</math> scales in the <math>i^{\text{th}}</math> block.Whenever a scale is removed, a transition layer merges the concatenated features using 1x1 convolution and feeds the fine grained features to coarser scales.<br />
# Remove unnecessary computations: Group the computation in "diagonal blocks"; this propagates the example along paths that are required for the evaluation of the next classifier.<br />
<br />
The strategy of minimizing unnecessary computations when the computational budget is over is known as the ''lazy evaluation''.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases. The authors attributed this to the fact that MSDNets are able to produce low-resolution feature maps well-suited for classification after just a few layers, in contrast to the high-resolution feature maps in early layers of ResNets or DenseNets. Ensemble networks need to repeat computations of similar low-level features repeatedly when new models need to be evaluated, so their accuracy results do not increase as fast when computational budget increases. <br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]] [[File:cifar10msdnet.png | 700px|thumb|center|CIFAR-10 results.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
The following figure shows examples of what was deemed "easy" and "hard" examples by the network. The top row contains images of either red wine or volcanos that were easily classified, thus exiting the network early and reducing required computations. The bottom row contains examples of "hard" images that were incorrectly classified by the first classifier but were correctly classified by the last layer.<br />
<br />
[[File:MSDNet_visualizingearlyclassifying.png | 700px|thumb|center|Examples of "hard"/"easy" classification]]<br />
<br />
= Ablation study =<br />
Additional experiments were performed to shed light on multi-scale feature maps, dense connectivity, and intermediate classifiers. This experiment started with an MSDNet with six intermediate classifiers and each of these components were removed, one at a time. To make our comparisons fair, the computational costs of the full networks were kept similar by adapting the network width. After removing all the three components, a VGG-like convolutional network is obtained. The classification accuracy of all classifiers is shown in the image below.<br />
<br />
[[File:Screenshot_from_2018-03-29_14-58-03.png]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper, written by the authors: https://github.com/gaohuang/MSDNet<br />
<br />
= Sources =<br />
# Huang, G., Chen, D., Li, T., Wu, F., Maaten, L., & Weinberger, K. Q. (n.d.). Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018. doi:1703.09844 <br />
# Huang, G. (n.d.). Gaohuang/MSDNet. Retrieved March 25, 2018, from https://github.com/gaohuang/MSDNet<br />
# LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal brain damage." Advances in neural information processing systems. 1990.<br />
# Hassibi, Babak, David G. Stork, and Gregory J. Wolff. "Optimal brain surgeon and general network pruning." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.<br />
# Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).<br />
# Hubara, Itay, et al. "Binarized neural networks." Advances in neural information processing systems. 2016.<br />
# Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016.<br />
# Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In ACM SIGKDD, pp. 535–541. ACM, 2006.<br />
# Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.<br />
# Teerapittayanon, Surat, Bradley McDanel, and H. T. Kung. "Branchynet: Fast inference via early exiting from deep neural networks." Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=36367Multi-scale Dense Networks for Resource Efficient Image Classification2018-04-20T21:25:39Z<p>Ws2chen: /* Coarse Level Features Needed For Classification */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources. For example, the winner of the COCO 2016 competition was an [http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf ensemble of CNNs], which are likely far too resource-heavy to be used in any resource-limited application.<br />
<br />
Note: <br />
* There are two kinds of efficiency in this context, computational efficiency and resource efficiency.<br />
* There are multiple cases for hard examples, such as large number of classification label, randomly blocked or zoomed images, or even complicated background that makes image recognition even more difficult. <br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
* Anytime Prediction: What is the best prediction the network can provide when suddenly prompted?<br />
* Budget Batch Predictions: Given a maximum amount of computational resources, how well does the network do on the batch?<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Much of the existing work on convolution networks that are computationally efficient at test time focus on reducing model size after training. Many existing methods for refining an accurate network to be more efficient include weight pruning [3,4,5], quantization of weights [6,7] (during or after training), and knowledge distillation [8,9], which trains smaller student networks to reproduce the output of a much larger teacher network. The proposed work differs from these approaches as it trains a single model which trades computation efficiency for accuracy at test time without re-training or finetuning.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network. (For example, a Branchynet (Teerapittayanon et al., 2016) is a deeply supervised network explicitly designed for efficiency. A Branchynet has multiple exit branches at various depths, each leading to a softmax classifier. At test time, if a classifier on an early exit branch makes a confident prediction, the rest of network need not be evaluated. However, unlike in MSDnets, in Branchynets early classifiers to not have access to low-resolution features. )<br />
* The feature concatenation method from DenseNets(Dense net is CNN with shorter connections close to input and output) allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Problem Setup =<br />
The authors consider two settings that impose computational constraints at prediction time.<br />
<br />
== Anytime Prediction ==<br />
In the anytime prediction setting (Grubb & Bagnell, 2012), there is a finite computational budget <math>B > 0</math> available for each test example <math>x</math>. Once the budget is exhausted, the prediction for the class is output using early exit. The budget is nondeterministic and varies per test instance.<br />
They assume that the budget is drawn from some joint distribution <math>P(x,B)</math>. They denote the loss of a model <math>f(x)</math> that has to produce a prediction for instance x with a budget of <math>B</math> by <math>L(f(x),B)</math>. The goal of the anytime learner is to minimize the expected loss under the budget distribution <math>L(f)=\mathop{\mathbb{E}}[L(f(x),B)]_{P(x,B)}</math>.<br />
<br />
== Budgeted Batch Classification ==<br />
In the budgeted batch classification setting, the model needs to classify a set of examples <math>D_{test} = {x_1, . . . , x_M}</math> within a finite computational budget <math>B > 0</math> that is known in advance. The learner aims to minimize the loss across all examples in the <math>D_{test}</math>, within a cumulative cost bounded by <math>B</math>, which is denoted as <math>L(f(D_{test}),B)</math> for some suitable loss function <math>L</math>.<br />
<br />
= Multi-Scale Dense Networks =<br />
Two solutions to the problems mentioned above: <br />
<br />
* Train multiple networks of increasing capacity, and evaluate them at test time.<br />
**Anytime setting: the evaluation can be stopped at any time point and return the most recent prediction<br />
**Batch setting: the evaluation is stopped with no continuous training when the network is good enough.<br />
* Build a deep network with a cascade of classifiers operating on the features of internal layers.<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
The term coarse level feature refers to a set of filters in a CNN with low resolution. There are several ways to create such features. These methods are typically refereed to as down sampling. Some example of layers that perform this function are: max pooling, average pooling and convolution with strides. In this architecture, convolution with strides will be used to create coarse features. <br />
<br />
'''Concern:''' Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.<br />
<br />
'''Solution:''' To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The feature maps at a particular layer and scale are computed by concatenating results from up to two convolutions: a standard convolution is first applied to same-scale features from the previous layer to pass on high-resolution information that subsequent layers can use to construct better coarse features, and if possible, a strided convolution is also applied on the finer-scale feature map from the previous layer to produce coarser features amenable to classification. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers consists of two convolutional layers, an average pooling layer and a linear layer and are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss: <br />
<br />
<math>\frac{1}{|\mathcal{D}|} \sum_{x,y \in \mathcal{D}} \sum_{k}w_k L(f_k) </math><br />
<br />
Here <math>w_i</math> represents the weights and <math>L(f_k)</math> represents the logistic loss of each classifier. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
=== Network Reduction and Lazy Evaluation ===<br />
There are two ways to reduce the computational needs of MSDNets:<br />
<br />
# Reduce the size of the network by splitting it into <math>S</math> blocks along the depth dimension and keeping the <math>(S-i+1)</math> scales in the <math>i^{\text{th}}</math> block.Whenever a scale is removed, a transition layer merges the concatenated features using 1x1 convolution and feeds the fine grained features to coarser scales.<br />
# Remove unnecessary computations: Group the computation in "diagonal blocks"; this propagates the example along paths that are required for the evaluation of the next classifier.<br />
<br />
The strategy of minimizing unnecessary computations when the computational budget is over is known as the ''lazy evaluation''.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases. The authors attributed this to the fact that MSDNets are able to produce low-resolution feature maps well-suited for classification after just a few layers, in contrast to the high-resolution feature maps in early layers of ResNets or DenseNets. Ensemble networks need to repeat computations of similar low-level features repeatedly when new models need to be evaluated, so their accuracy results do not increase as fast when computational budget increases. <br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]] [[File:cifar10msdnet.png | 700px|thumb|center|CIFAR-10 results.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
The following figure shows examples of what was deemed "easy" and "hard" examples by the network. The top row contains images of either red wine or volcanos that were easily classified, thus exiting the network early and reducing required computations. The bottom row contains examples of "hard" images that were incorrectly classified by the first classifier but were correctly classified by the last layer.<br />
<br />
[[File:MSDNet_visualizingearlyclassifying.png | 700px|thumb|center|Examples of "hard"/"easy" classification]]<br />
<br />
= Ablation study =<br />
Additional experiments were performed to shed light on multi-scale feature maps, dense connectivity, and intermediate classifiers. This experiment started with an MSDNet with six intermediate classifiers and each of these components were removed, one at a time. To make our comparisons fair, the computational costs of the full networks were kept similar by adapting the network width. After removing all the three components, a VGG-like convolutional network is obtained. The classification accuracy of all classifiers is shown in the image below.<br />
<br />
[[File:Screenshot_from_2018-03-29_14-58-03.png]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper, written by the authors: https://github.com/gaohuang/MSDNet<br />
<br />
= Sources =<br />
# Huang, G., Chen, D., Li, T., Wu, F., Maaten, L., & Weinberger, K. Q. (n.d.). Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018. doi:1703.09844 <br />
# Huang, G. (n.d.). Gaohuang/MSDNet. Retrieved March 25, 2018, from https://github.com/gaohuang/MSDNet<br />
# LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal brain damage." Advances in neural information processing systems. 1990.<br />
# Hassibi, Babak, David G. Stork, and Gregory J. Wolff. "Optimal brain surgeon and general network pruning." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.<br />
# Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).<br />
# Hubara, Itay, et al. "Binarized neural networks." Advances in neural information processing systems. 2016.<br />
# Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016.<br />
# Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In ACM SIGKDD, pp. 535–541. ACM, 2006.<br />
# Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.<br />
# Teerapittayanon, Surat, Bradley McDanel, and H. T. Kung. "Branchynet: Fast inference via early exiting from deep neural networks." Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=36366Multi-scale Dense Networks for Resource Efficient Image Classification2018-04-20T21:25:08Z<p>Ws2chen: /* Coarse Level Features Needed For Classification */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either efficient networks, but don't do well on hard examples, or large networks that do well on all examples but require a large amount of resources. For example, the winner of the COCO 2016 competition was an [http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf ensemble of CNNs], which are likely far too resource-heavy to be used in any resource-limited application.<br />
<br />
Note: <br />
* There are two kinds of efficiency in this context, computational efficiency and resource efficiency.<br />
* There are multiple cases for hard examples, such as large number of classification label, randomly blocked or zoomed images, or even complicated background that makes image recognition even more difficult. <br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
* Anytime Prediction: What is the best prediction the network can provide when suddenly prompted?<br />
* Budget Batch Predictions: Given a maximum amount of computational resources, how well does the network do on the batch?<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Much of the existing work on convolution networks that are computationally efficient at test time focus on reducing model size after training. Many existing methods for refining an accurate network to be more efficient include weight pruning [3,4,5], quantization of weights [6,7] (during or after training), and knowledge distillation [8,9], which trains smaller student networks to reproduce the output of a much larger teacher network. The proposed work differs from these approaches as it trains a single model which trades computation efficiency for accuracy at test time without re-training or finetuning.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network. (For example, a Branchynet (Teerapittayanon et al., 2016) is a deeply supervised network explicitly designed for efficiency. A Branchynet has multiple exit branches at various depths, each leading to a softmax classifier. At test time, if a classifier on an early exit branch makes a confident prediction, the rest of network need not be evaluated. However, unlike in MSDnets, in Branchynets early classifiers to not have access to low-resolution features. )<br />
* The feature concatenation method from DenseNets(Dense net is CNN with shorter connections close to input and output) allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Problem Setup =<br />
The authors consider two settings that impose computational constraints at prediction time.<br />
<br />
== Anytime Prediction ==<br />
In the anytime prediction setting (Grubb & Bagnell, 2012), there is a finite computational budget <math>B > 0</math> available for each test example <math>x</math>. Once the budget is exhausted, the prediction for the class is output using early exit. The budget is nondeterministic and varies per test instance.<br />
They assume that the budget is drawn from some joint distribution <math>P(x,B)</math>. They denote the loss of a model <math>f(x)</math> that has to produce a prediction for instance x with a budget of <math>B</math> by <math>L(f(x),B)</math>. The goal of the anytime learner is to minimize the expected loss under the budget distribution <math>L(f)=\mathop{\mathbb{E}}[L(f(x),B)]_{P(x,B)}</math>.<br />
<br />
== Budgeted Batch Classification ==<br />
In the budgeted batch classification setting, the model needs to classify a set of examples <math>D_{test} = {x_1, . . . , x_M}</math> within a finite computational budget <math>B > 0</math> that is known in advance. The learner aims to minimize the loss across all examples in the <math>D_{test}</math>, within a cumulative cost bounded by <math>B</math>, which is denoted as <math>L(f(D_{test}),B)</math> for some suitable loss function <math>L</math>.<br />
<br />
= Multi-Scale Dense Networks =<br />
Two solutions to the problems mentioned above: <br />
<br />
* Train multiple networks of increasing capacity, and evaluate them at test time.<br />
**Anytime setting: the evaluation can be stopped at any time point and return the most recent prediction<br />
**Batch setting: the evaluation is stopped with no continuous training when the network is good enough.<br />
* Build a deep network with a cascade of classifiers operating on the features of internal layers.<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
The term coarse level feature refers to a set of filters in a CNN with low resolution. There are several ways to create such features. These methods are typically refereed to as down sampling. Some example of layers that perform this function are: max pooling, average pooling and convolution with strides. In this architecture, convolution with strides will be used to create coarse features. <br />
<br />
**Concern: **Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.<br />
<br />
**Solution: **To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The feature maps at a particular layer and scale are computed by concatenating results from up to two convolutions: a standard convolution is first applied to same-scale features from the previous layer to pass on high-resolution information that subsequent layers can use to construct better coarse features, and if possible, a strided convolution is also applied on the finer-scale feature map from the previous layer to produce coarser features amenable to classification. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-CNN-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers consists of two convolutional layers, an average pooling layer and a linear layer and are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss: <br />
<br />
<math>\frac{1}{|\mathcal{D}|} \sum_{x,y \in \mathcal{D}} \sum_{k}w_k L(f_k) </math><br />
<br />
Here <math>w_i</math> represents the weights and <math>L(f_k)</math> represents the logistic loss of each classifier. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
=== Network Reduction and Lazy Evaluation ===<br />
There are two ways to reduce the computational needs of MSDNets:<br />
<br />
# Reduce the size of the network by splitting it into <math>S</math> blocks along the depth dimension and keeping the <math>(S-i+1)</math> scales in the <math>i^{\text{th}}</math> block.Whenever a scale is removed, a transition layer merges the concatenated features using 1x1 convolution and feeds the fine grained features to coarser scales.<br />
# Remove unnecessary computations: Group the computation in "diagonal blocks"; this propagates the example along paths that are required for the evaluation of the next classifier.<br />
<br />
The strategy of minimizing unnecessary computations when the computational budget is over is known as the ''lazy evaluation''.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases. The authors attributed this to the fact that MSDNets are able to produce low-resolution feature maps well-suited for classification after just a few layers, in contrast to the high-resolution feature maps in early layers of ResNets or DenseNets. Ensemble networks need to repeat computations of similar low-level features repeatedly when new models need to be evaluated, so their accuracy results do not increase as fast when computational budget increases. <br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]] [[File:cifar10msdnet.png | 700px|thumb|center|CIFAR-10 results.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
The following figure shows examples of what was deemed "easy" and "hard" examples by the network. The top row contains images of either red wine or volcanos that were easily classified, thus exiting the network early and reducing required computations. The bottom row contains examples of "hard" images that were incorrectly classified by the first classifier but were correctly classified by the last layer.<br />
<br />
[[File:MSDNet_visualizingearlyclassifying.png | 700px|thumb|center|Examples of "hard"/"easy" classification]]<br />
<br />
= Ablation study =<br />
Additional experiments were performed to shed light on multi-scale feature maps, dense connectivity, and intermediate classifiers. This experiment started with an MSDNet with six intermediate classifiers and each of these components were removed, one at a time. To make our comparisons fair, the computational costs of the full networks were kept similar by adapting the network width. After removing all the three components, a VGG-like convolutional network is obtained. The classification accuracy of all classifiers is shown in the image below.<br />
<br />
[[File:Screenshot_from_2018-03-29_14-58-03.png]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper, written by the authors: https://github.com/gaohuang/MSDNet<br />
<br />
= Sources =<br />
# Huang, G., Chen, D., Li, T., Wu, F., Maaten, L., & Weinberger, K. Q. (n.d.). Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018. doi:1703.09844 <br />
# Huang, G. (n.d.). Gaohuang/MSDNet. Retrieved March 25, 2018, from https://github.com/gaohuang/MSDNet<br />
# LeCun, Yann, John S. Denker, and Sara A. Solla. "Optimal brain damage." Advances in neural information processing systems. 1990.<br />
# Hassibi, Babak, David G. Stork, and Gregory J. Wolff. "Optimal brain surgeon and general network pruning." Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.<br />
# Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).<br />
# Hubara, Itay, et al. "Binarized neural networks." Advances in neural information processing systems. 2016.<br />
# Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016.<br />
# Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In ACM SIGKDD, pp. 535–541. ACM, 2006.<br />
# Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.<br />
# Teerapittayanon, Surat, Bradley McDanel, and H. T. Kung. "Branchynet: Fast inference via early exiting from deep neural networks." Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=36362Training And Inference with Integers in Deep Neural Networks2018-04-20T21:13:02Z<p>Ws2chen: /* WAGE Quantization */</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated and have high memory requirements while performing many floating-point operations (FLOPs). As a result, running many of these models will be very expensive in terms of energy use, and using state-of-the-art networks in applications where energy is limited can be very difficult. In order to overcome this and allow use of these networks in situations with low energy availability, the energy costs must be reduced while trying to maintain as high network performance as possible and/or practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. The authors address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated. A TPU is a Tensor Processing Unit developed by Google for Tensor operations. TPU is comparative to a GPU but produces higher IO per second for low precision computations.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V<sup>[[#References|[1]]]</sup><br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations <sup>[[#References|[2]]]</sup> add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. XNOR-Net <sup>[[#References|[11]]]</sup> uses bitwise operations to approximate convolutions in a highly memory-efficient manner, and applies a filter-wise scaling factor for weights to improve performance. However, these floating-point factors are calculated simultaneously during training, which aggravates the training effort. Ternary weight networks (TWN) <sup>[[#References|[3]]]</sup> and Trained ternary quantization (TTQ)<sup>[[#References|[9]]]</sup> offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds. Tang et al.<sup>[[#References|[14]]]</sup> achieve impressive results by using a binarization scheme according to which floating-point activation vectors are approximated as linear combinations of binary vectors, where the weights in the linear combination are floating-point. Still other approaches rely on relative quantization<sup>[[#References|[13]]]</sup>; however, an efficient implementation is difficult to apply in practice due to the requirements of persisting and applying a codebook.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
<br />
As can be observed from the graph, the author extended the original definition of errors to multi-layer: error e is the gradient of activation a for the perspective of each convolution or fully-connected layer, while gradient g particularly refers to the gradient accumulation of weight W. Considering the i-th layer of a feed-forward network. In this sense, the error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math>, <br />
<br />
where <math>round</math> approximates continuous values to their nearest discrete state, and <math>Clip</math> is the saturation function that clips unbounded values to <math>[-1 + \sigma, 1 - \sigma]</math>. Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function, a distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA<sup>[[#References|[4]]]</sup>.<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original initialization method for <math>\eta</math> is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size seen already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net<sup>[[#References|[5]]]</sup>) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant<sup>[[#References|[6]]]</sup> with 32C5-MP2-64C5-MP2-512FC-10SSE.<br />
<br />
SVHN & CIFAR10: VGG variant<sup>[[#References|[7]]]</sup> with 2×(128C3)-MP2-2×(256C3)-MP2-2×(512C3)-MP2-1024FC-10SSE. For CIFAR10 dataset, the data augmentation is followed in Lee et al. (2015)<sup>[[#References|[10]]]</sup> for training.<br />
<br />
ImageNet: AlexNet variant<sup>[[#References|[8]]]</sup> on ILSVRC12 dataset.<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below and the error density for a single layer is compared with the Vanilla network.<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.png|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
The table below shows the test error rates on CIFAR10 when left-shift upper boundary with factor γ. From this table we could see that large values play critical roles for backpropagation training even though they are infrequent while the majority with small values are just noise.<br />
<br />
[[File:testerror_rate.png|center]]<br />
<br />
=== Bitwidth of Gradients ===<br />
<br />
The authors next investigated the choice of a proper <math>k_G</math> for gradients using the CIFAR10 dataset. <br />
<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
<br />
The results show similar bitwidth requirements as the last experiment for <math>k_E</math>.<br />
<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added. 7 models are used: 2888 from the first experiment, 288C for more accurate errors (12 bits), 28C8 for larger buffer space, 28f8 for non-quantization of gradients, 28ff for errors and gradients in float32, and 28ff with BN added. The baseline vanilla model refers to the original AlexNet architecture. <br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
<br />
The comparison between 28C8 and 288C shows that the model may perform better if it has more buffer space <math>k_G</math> for gradient accumulation than if it has high-resolution orientation <math>k_E</math>. The authors also noted that batch normalization and <math>k_G</math> are more important for ImageNet because the training set samples are highly variant.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. By quantizing all operations and operands of a network, the authors successfully reduce the energy costs of both training and inference with deep learning architectures. Future work may further improve compression and memory requirements.<br />
<br />
== Implementation ==<br />
The following repository provides the source code for the paper: https://github.com/boluoweifenda/WAGE. The repository provides the source code as written by the authors, in Tensorflow.<br />
[[File:DAIMA.jpg|center|thumb|800px|]]<br />
== Limitation == <br />
<br />
* The paper states the advantages in energy costs, but is there any limitation or trade-off by selecting integer instead of float-point-operation? What is a good situation for such implementation? The authors should explain more on this.<br />
<br />
== References ==<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.<br />
# Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervisednets. In Artificial Intelligence and Statistics, pp. 562–570, 2015.<br />
# Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.<br />
# “Boluoweifenda/WAGE.” GitHub, github.com/boluoweifenda/WAGE.<br />
# Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.<br />
# Tang, Wei, Gang Hua, and Liang Wang. "How to train a compact binary neural network with high accuracy?." AAAI. 2017.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=36361Training And Inference with Integers in Deep Neural Networks2018-04-20T21:12:39Z<p>Ws2chen: /* WAGE Quantization */</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated and have high memory requirements while performing many floating-point operations (FLOPs). As a result, running many of these models will be very expensive in terms of energy use, and using state-of-the-art networks in applications where energy is limited can be very difficult. In order to overcome this and allow use of these networks in situations with low energy availability, the energy costs must be reduced while trying to maintain as high network performance as possible and/or practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. The authors address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated. A TPU is a Tensor Processing Unit developed by Google for Tensor operations. TPU is comparative to a GPU but produces higher IO per second for low precision computations.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V<sup>[[#References|[1]]]</sup><br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations <sup>[[#References|[2]]]</sup> add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. XNOR-Net <sup>[[#References|[11]]]</sup> uses bitwise operations to approximate convolutions in a highly memory-efficient manner, and applies a filter-wise scaling factor for weights to improve performance. However, these floating-point factors are calculated simultaneously during training, which aggravates the training effort. Ternary weight networks (TWN) <sup>[[#References|[3]]]</sup> and Trained ternary quantization (TTQ)<sup>[[#References|[9]]]</sup> offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds. Tang et al.<sup>[[#References|[14]]]</sup> achieve impressive results by using a binarization scheme according to which floating-point activation vectors are approximated as linear combinations of binary vectors, where the weights in the linear combination are floating-point. Still other approaches rely on relative quantization<sup>[[#References|[13]]]</sup>; however, an efficient implementation is difficult to apply in practice due to the requirements of persisting and applying a codebook.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
<br />
As can be observed from the graph, the author extended the original definition of errors to multi-layer: error e is the gradient of activation a for the perspective of each convolution or fully-connected layer, while gradient g particularly refers to the gradient accumulation of weight W. Considering the i-th layer of a feed-forward network.<br />
<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math>, <br />
<br />
where <math>round</math> approximates continuous values to their nearest discrete state, and <math>Clip</math> is the saturation function that clips unbounded values to <math>[-1 + \sigma, 1 - \sigma]</math>. Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function, a distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA<sup>[[#References|[4]]]</sup>.<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original initialization method for <math>\eta</math> is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size seen already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net<sup>[[#References|[5]]]</sup>) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant<sup>[[#References|[6]]]</sup> with 32C5-MP2-64C5-MP2-512FC-10SSE.<br />
<br />
SVHN & CIFAR10: VGG variant<sup>[[#References|[7]]]</sup> with 2×(128C3)-MP2-2×(256C3)-MP2-2×(512C3)-MP2-1024FC-10SSE. For CIFAR10 dataset, the data augmentation is followed in Lee et al. (2015)<sup>[[#References|[10]]]</sup> for training.<br />
<br />
ImageNet: AlexNet variant<sup>[[#References|[8]]]</sup> on ILSVRC12 dataset.<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below and the error density for a single layer is compared with the Vanilla network.<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.png|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
The table below shows the test error rates on CIFAR10 when left-shift upper boundary with factor γ. From this table we could see that large values play critical roles for backpropagation training even though they are infrequent while the majority with small values are just noise.<br />
<br />
[[File:testerror_rate.png|center]]<br />
<br />
=== Bitwidth of Gradients ===<br />
<br />
The authors next investigated the choice of a proper <math>k_G</math> for gradients using the CIFAR10 dataset. <br />
<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
<br />
The results show similar bitwidth requirements as the last experiment for <math>k_E</math>.<br />
<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added. 7 models are used: 2888 from the first experiment, 288C for more accurate errors (12 bits), 28C8 for larger buffer space, 28f8 for non-quantization of gradients, 28ff for errors and gradients in float32, and 28ff with BN added. The baseline vanilla model refers to the original AlexNet architecture. <br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
<br />
The comparison between 28C8 and 288C shows that the model may perform better if it has more buffer space <math>k_G</math> for gradient accumulation than if it has high-resolution orientation <math>k_E</math>. The authors also noted that batch normalization and <math>k_G</math> are more important for ImageNet because the training set samples are highly variant.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. By quantizing all operations and operands of a network, the authors successfully reduce the energy costs of both training and inference with deep learning architectures. Future work may further improve compression and memory requirements.<br />
<br />
== Implementation ==<br />
The following repository provides the source code for the paper: https://github.com/boluoweifenda/WAGE. The repository provides the source code as written by the authors, in Tensorflow.<br />
[[File:DAIMA.jpg|center|thumb|800px|]]<br />
== Limitation == <br />
<br />
* The paper states the advantages in energy costs, but is there any limitation or trade-off by selecting integer instead of float-point-operation? What is a good situation for such implementation? The authors should explain more on this.<br />
<br />
== References ==<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.<br />
# Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervisednets. In Artificial Intelligence and Statistics, pp. 562–570, 2015.<br />
# Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.<br />
# “Boluoweifenda/WAGE.” GitHub, github.com/boluoweifenda/WAGE.<br />
# Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.<br />
# Tang, Wei, Gang Hua, and Liang Wang. "How to train a compact binary neural network with high accuracy?." AAAI. 2017.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=36360Training And Inference with Integers in Deep Neural Networks2018-04-20T21:10:28Z<p>Ws2chen: /* WAGE Quantization */</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated and have high memory requirements while performing many floating-point operations (FLOPs). As a result, running many of these models will be very expensive in terms of energy use, and using state-of-the-art networks in applications where energy is limited can be very difficult. In order to overcome this and allow use of these networks in situations with low energy availability, the energy costs must be reduced while trying to maintain as high network performance as possible and/or practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. The authors address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated. A TPU is a Tensor Processing Unit developed by Google for Tensor operations. TPU is comparative to a GPU but produces higher IO per second for low precision computations.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V<sup>[[#References|[1]]]</sup><br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations <sup>[[#References|[2]]]</sup> add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. XNOR-Net <sup>[[#References|[11]]]</sup> uses bitwise operations to approximate convolutions in a highly memory-efficient manner, and applies a filter-wise scaling factor for weights to improve performance. However, these floating-point factors are calculated simultaneously during training, which aggravates the training effort. Ternary weight networks (TWN) <sup>[[#References|[3]]]</sup> and Trained ternary quantization (TTQ)<sup>[[#References|[9]]]</sup> offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds. Tang et al.<sup>[[#References|[14]]]</sup> achieve impressive results by using a binarization scheme according to which floating-point activation vectors are approximated as linear combinations of binary vectors, where the weights in the linear combination are floating-point. Still other approaches rely on relative quantization<sup>[[#References|[13]]]</sup>; however, an efficient implementation is difficult to apply in practice due to the requirements of persisting and applying a codebook.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
<br />
We extend the original definition of errors to multi-layer: error e is the gradient of activation a for the perspective of each convolution or fully-connected layer, while gradient g particularly refers to the gradient accumulation of weight W. Considering the i-th layer of a feed-forward network.<br />
<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math>, <br />
<br />
where <math>round</math> approximates continuous values to their nearest discrete state, and <math>Clip</math> is the saturation function that clips unbounded values to <math>[-1 + \sigma, 1 - \sigma]</math>. Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function, a distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA<sup>[[#References|[4]]]</sup>.<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original initialization method for <math>\eta</math> is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size seen already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net<sup>[[#References|[5]]]</sup>) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant<sup>[[#References|[6]]]</sup> with 32C5-MP2-64C5-MP2-512FC-10SSE.<br />
<br />
SVHN & CIFAR10: VGG variant<sup>[[#References|[7]]]</sup> with 2×(128C3)-MP2-2×(256C3)-MP2-2×(512C3)-MP2-1024FC-10SSE. For CIFAR10 dataset, the data augmentation is followed in Lee et al. (2015)<sup>[[#References|[10]]]</sup> for training.<br />
<br />
ImageNet: AlexNet variant<sup>[[#References|[8]]]</sup> on ILSVRC12 dataset.<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below and the error density for a single layer is compared with the Vanilla network.<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.png|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
The table below shows the test error rates on CIFAR10 when left-shift upper boundary with factor γ. From this table we could see that large values play critical roles for backpropagation training even though they are infrequent while the majority with small values are just noise.<br />
<br />
[[File:testerror_rate.png|center]]<br />
<br />
=== Bitwidth of Gradients ===<br />
<br />
The authors next investigated the choice of a proper <math>k_G</math> for gradients using the CIFAR10 dataset. <br />
<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
<br />
The results show similar bitwidth requirements as the last experiment for <math>k_E</math>.<br />
<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added. 7 models are used: 2888 from the first experiment, 288C for more accurate errors (12 bits), 28C8 for larger buffer space, 28f8 for non-quantization of gradients, 28ff for errors and gradients in float32, and 28ff with BN added. The baseline vanilla model refers to the original AlexNet architecture. <br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
<br />
The comparison between 28C8 and 288C shows that the model may perform better if it has more buffer space <math>k_G</math> for gradient accumulation than if it has high-resolution orientation <math>k_E</math>. The authors also noted that batch normalization and <math>k_G</math> are more important for ImageNet because the training set samples are highly variant.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. By quantizing all operations and operands of a network, the authors successfully reduce the energy costs of both training and inference with deep learning architectures. Future work may further improve compression and memory requirements.<br />
<br />
== Implementation ==<br />
The following repository provides the source code for the paper: https://github.com/boluoweifenda/WAGE. The repository provides the source code as written by the authors, in Tensorflow.<br />
[[File:DAIMA.jpg|center|thumb|800px|]]<br />
== Limitation == <br />
<br />
* The paper states the advantages in energy costs, but is there any limitation or trade-off by selecting integer instead of float-point-operation? What is a good situation for such implementation? The authors should explain more on this.<br />
<br />
== References ==<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.<br />
# Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervisednets. In Artificial Intelligence and Statistics, pp. 562–570, 2015.<br />
# Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.<br />
# “Boluoweifenda/WAGE.” GitHub, github.com/boluoweifenda/WAGE.<br />
# Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.<br />
# Tang, Wei, Gang Hua, and Liang Wang. "How to train a compact binary neural network with high accuracy?." AAAI. 2017.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond&diff=36358On The Convergence Of ADAM And Beyond2018-04-20T20:54:39Z<p>Ws2chen: /* Extension: ADAMNC Algorithm */</p>
<hr />
<div>= Introduction =<br />
Stochastic gradient descent (SGD) is currently the dominant method of training deep networks. Variants of SGD that scale the gradients using information from past gradients have been very successful, since the learning rate is adjusted on a per-feature basis, with ADAGRAD being one example. However, ADAGRAD performance deteriorates when loss functions are nonconvex and gradients are dense. Several variants of ADAGRAD, such as RMSProp, ADAM, ADADELTA, and NADAM have been proposed, which address the issue by using exponential moving averages of squared past gradients, thereby limiting the update to only rely on the past few gradients. The following formula shows the per-parameter update for which is then vectorized:<br />
<math><br />
g_{t, i} = \nabla_\theta J( \theta_{t, i} ).<br />
</math><br />
<br />
After vectorizing the update per-parameter using SGD becomes:<br />
<math><br />
\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i}.<br />
</math><br />
<br />
The update for the parameter in the next step is calculated using the matrix vector product:<br />
<math><br />
\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}.<br />
</math><br />
<br />
This paper focuses strictly on the pitfalls in convergence of the ADAM optimizer from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The paper introduces the idea that it is possible for ADAM to get "stuck" in its weighted average history, preventing it from converging to an optimal solution. For example, in an experiment there may be a large spike in the gradient during some mini-batches. But since ADAM weighs the current update by the exponential moving averages of squared past gradients, the effect of the large spike in gradient is lost. To tackle these issues, several variants of ADAGRAD hav been proposed. The authors' analysis suggest that this can be prevented through novel but simple adjustments to the ADAM optimization algorithm, which can improve convergence. This paper is published in ICLR 2018.<br />
<br />
== Notation ==<br />
The paper presents the following framework as a generalization to all training algorithms, allowing us to fully define any specific variant such as AMSGrad or SGD entirely within it:<br />
<br />
[[File:training_algo_framework.png|700px|center]]<br />
<br />
Where we have <math> x_t </math> as our network parameters defined within a vector space <math> \mathcal{F} </math>. <math> \prod_{\mathcal{F}} (y) = </math> the projection of <math> y </math> on to the set <math> \mathcal{F} </math>. It should be noted that the <math>\sqrt{V_t}</math> in the expression <math> \prod_{\mathcal{F}, \sqrt{V_t}}</math> has been included as a typo.<br />
<math> \psi_t </math> and <math> \phi_t </math> correspond to arbitrary functions we will provide later, The former maps from the history of gradients to <math> \mathbb{R}^d </math> and the latter maps from the history of the gradients to positive semi definite matrices. And finally <math> f_t </math> is our loss function at some time <math> t </math>, the rest should be pretty self explanatory. Using this framework and defining different <math> \psi_t </math> , <math> \phi_t </math> will allow us to recover all different kinds of training algorithms under this one roof.<br />
<br />
=== SGD As An Example ===<br />
To recover SGD using this framework we simply select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = I </math> and <math>\alpha_t = \alpha / \sqrt{t}</math>. It is easy to see that no transformations are ultimately applied to any of the parameters based on any gradient history other than the most recent from <math> \phi_t </math> and that <math> \psi_t </math> in no way transforms any of the parameters by any specific amount as <math> V_t = I </math> has no impact later on.<br />
<br />
=== ADAGRAD As Another Example ===<br />
<br />
To recover ADAGRAD, we select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = \frac{\sum_{i=1}^{t} g_i^2}{t} </math>, and <math>\alpha_t = \alpha / \sqrt{t}</math>. Therefore, compared to SGD, ADAGRAD uses a different step size for each parameter, based on the past gradients for that parameter; the learning rate becomes <math> \alpha_t = \alpha / \sqrt{\sum_i g_{i,j}^2} </math> for each parameter <math> j </math>. The authors note that this scheme is quite efficient when the gradients are sparse.<br />
<br />
=== ADAM As Another Example ===<br />
Once you can convince yourself that the recovery of SGD from the generalized framework is correct, you should understand the framework enough to see why the following setup for ADAM will allow us to recover the behaviour we want. ADAM has the ability to define a "learning rate" for every parameter based on how much that parameter moves over time (a.k.a its momentum) supposedly to help with the learning process.<br />
<br />
In order to do this, we will choose <math> \phi_t (g_1, \dotsc, g_t) = (1 - \beta_1) \sum_{i=0}^{t} {\beta_1}^{t - i} g_t </math>, psi to be <math> \psi_t (g_1, \dotsc, g_t) = (1 - \beta_2)</math>diag<math>( \sum_{i=0}^{t} {\beta_2}^{t - i} {g_t}^2) </math>, and keep <math>\alpha_t = \alpha / \sqrt{t}</math>. This setup is equivalent to choosing a learning rate decay of <math>\alpha / \sqrt{\sum_i g_{i,j}}</math> for <math>j \in [d]</math>.<br />
<br />
From this, we can now see that <math>m_t </math> gets filled up with the exponentially weighted average of the history of our gradients that we have come across so far in the algorithm. And that as we proceed to update we scale each one of our parameters by dividing out <math> V_t </math> (in the case of diagonal it is just one over the diagonal entry) which contains the exponentially weighted average of each parameter's momentum (<math> {g_t}^2 </math>) across our training so far in the algorithm. Thus each parameter has its own unique scaling by its second moment or momentum. Intuitively, from a physical perspective, if each parameter is a ball rolling around in the optimization landscape what we are now doing is instead of having the ball change positions on the landscape at a fixed velocity (i.e. momentum of 0) the ball now has the ability to accelerate and speed up or slow down if it is on a steep hill or flat trough in the landscape (i.e. a momentum that can change with time).<br />
<br />
= <math> \Gamma_t </math>, an Interesting Quantity =<br />
Now that we have an idea of what ADAM looks like in this framework, let us now investigate the following:<br />
<br />
<center><math> \Gamma_{t + 1} = \frac{\sqrt{V_{t+1}}}{\alpha_{t+1}} - \frac{\sqrt{V_t}}{\alpha_t} </math></center><br />
<br />
Which essentially measure the change of the "Inverse of the learning rate" across time (since we are using alpha's as step sizes). A key observation is that for SGD and ADAGRAD, <math>\Gamma_t \succeq 0</math> for all <math>t \in [T]</math>, which simply follows from the update rules of SGD and ADAGRAD. Looking back to our example of SGD it's not hard to see that this quantity is strictly positive semidefinite, which leads to "non-increasing" learning rates, which is a desired property. However, that is not the case with ADAM, and can pose a problem in a theoretical and applied setting. The problem ADAM can face is that <math> \Gamma_t </math> can potentially be indefinite for <math>t \in [T]</math>, which the original proof assumed it could not be. The math for this proof is VERY long so instead we will opt for an example to showcase why this could be an issue.<br />
<br />
Consider the loss function <math> f_t(x) = \begin{cases} <br />
Cx & \text{for } t \text{ mod 3} = 1 \\<br />
-x & \text{otherwise}<br />
\end{cases} </math><br />
<br />
Where we have <math> C > 2 </math> and <math> \mathcal{F} </math> is <math> [-1,1] </math>. Additionally we choose <math> \beta_1 = 0 </math> and <math> \beta_2 = 1/(1+C^2) </math>. We then proceed to plug this into our framework from before. This function is periodic and it's easy to see that it has the gradient of C once and then a gradient of -1 twice every period. It has an optimal solution of <math> x = -1 </math> (from a regret standpoint), but using ADAM we would eventually converge at <math> x = 1 </math>, since <math> \psi_t </math> would scale down the <math> C </math> by a factor of almost <math> C </math> so that it's unable to "overpower" the multiple -1's.<br />
<br />
We formalize this intuition in the results below.<br />
<br />
'''Theorem 1.''' There is an online convex optimization problem where ADAM has non-zero average regret. i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
One might think that adding a small constant in the denominator of the update function can help avoid this issue by modifying the update for ADAM as follow:<br />
\begin{align}<br />
\hat x_{t+1} = x_t - \alpha_t m_t/\sqrt{V_t + \epsilon \mathbb{I}}<br />
\end{align}<br />
<br />
The selection of <math>\epsilon</math> appears to be crucial for the performance of the algorithm in practice. However, this work shows that for any constant <math>\epsilon > 0</math>, there exists an online optimization setting where ADAM has non-zero average regret asymptotically.<br />
<br />
'''Theorem 2.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is an online convex optimization problem where ADAM has non-zero average regret i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
The theorem shows that the convergence of the algorithm to the optimal solution will not be improved by momentum or regularization via <math> \varepsilon </math> with constant <math> \beta_1 </math> and <math> \beta_2</math>.<br />
<br />
<br />
'''Theorem 3.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is a stochastic convex optimization problem for which ADAM does not converge to the optimal solution. <br />
<br />
Kingama & Ba (2015) mentioned that the analysis of ADAM relies on decreasing <math> \beta_1 </math> over time. As <math> \beta_2 </math> is the critical parameter, the examples could be easily extended to the case where <math> \beta_1 </math> is decreasing over time. The paper only focus on proving non-convergence of ADAM when <math> \beta_1 </math> is constant.<br />
<br />
= AMSGrad as an improvement to ADAM =<br />
There is a very simple intuitive fix to ADAM to handle this problem. We simply scale our historical weighted average by the maximum we have seen so far to avoid the negative sign problem. There is a very simple one-liner adaptation of ADAM to get to AMSGRAD:<br />
[[File:AMSGrad_algo.png|700px|center]]<br />
<br />
Below are some simple plots comparing ADAM and AMSGrad, the first are from the paper and the second are from another individual who attempted to recreate the experiments. The two plots somewhat disagree with one another so take this heuristic improvement with a grain of salt.<br />
<br />
[[File:AMSGrad_vs_adam.png|900px|center]]<br />
<br />
Here is another example of a one-dimensional convex optimization problem where ADAM fails to converge<br />
<br />
[[File:AMSGrad_vs_adam3.png|900px|center]]<br />
<br />
[[File:AMSGrad_vs_adam2.png|700px|center]]<br />
<br />
= Extension: ADAMNC Algorithm =<br />
<br />
An alternative approach is to use an increasing schedule of <math> \beta_2 </math> in ADAM. This approach, unlike Algorithm 2 does not require changing the structure of ADAM but rather uses a non-constant <math> \beta_1 </math>and <math> \beta_2 </math>. The pseudocode for the algorithm, ADAMNC, is provided in the Algorithm 3. We show that by appropriate selection of <math> \beta_1^t </math> and <math> \beta_2^t </math>, we can achieve good convergence rates.<br />
<br />
[[File:ADAMNC_METHOD.png|500px|center]]<br />
<br />
= Conclusion =<br />
The authors have introduced a framework for which they could view several different training algorithms. From there they used it to recover SGD as well as ADAM. In their recovery of ADAM the authors investigated the change of the inverse of the learning rate over time to discover in certain cases there were convergence issues. They proposed a new heuristic AMSGrad to help deal with this problem and presented some empirical results that show it may have helped ADAM slightly. Thanks for your time.<br />
<br />
== Critique ==<br />
The contrived example which serves as the intuition to illustrate the failure of ADAM is not convincing, since we can construct similar failure examples for SGD as well. <br />
Consider the loss function <br />
<br />
<math> f_t(x) = \begin{cases} <br />
-x & \text{for } t \text{ mod 2} = 1 \\<br />
-\frac{1}{2} x^2 & \text{otherwise}<br />
\end{cases} <br />
</math><br />
<br />
where <math> x \in \mathcal{F} = [-a, 1], a \in [1, \sqrt{2}) </math>. The optimal solution is <math>x=1</math>, but starting from initial point <math>x_{t=0} \le -1</math>, SGD will converge to <math>x = -a</math><br />
<br />
The author also fail to explain why ADAM is popular in experiments, why it works better than other optimizer in certain situations.<br />
<br />
==Implementation == <br />
Keras implementation of AMSGrad : https://gist.github.com/kashif/3eddc3c90e23d84975451f43f6e917da<br />
<br />
= Source =<br />
1. Sashank J. Reddi and Satyen Kale and Sanjiv Kumar. "On the Convergence of Adam and Beyond." International Conference on Learning Representations. 2018</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond&diff=36357On The Convergence Of ADAM And Beyond2018-04-20T20:54:27Z<p>Ws2chen: /* Extension: ADAMNC Algorithm */</p>
<hr />
<div>= Introduction =<br />
Stochastic gradient descent (SGD) is currently the dominant method of training deep networks. Variants of SGD that scale the gradients using information from past gradients have been very successful, since the learning rate is adjusted on a per-feature basis, with ADAGRAD being one example. However, ADAGRAD performance deteriorates when loss functions are nonconvex and gradients are dense. Several variants of ADAGRAD, such as RMSProp, ADAM, ADADELTA, and NADAM have been proposed, which address the issue by using exponential moving averages of squared past gradients, thereby limiting the update to only rely on the past few gradients. The following formula shows the per-parameter update for which is then vectorized:<br />
<math><br />
g_{t, i} = \nabla_\theta J( \theta_{t, i} ).<br />
</math><br />
<br />
After vectorizing the update per-parameter using SGD becomes:<br />
<math><br />
\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i}.<br />
</math><br />
<br />
The update for the parameter in the next step is calculated using the matrix vector product:<br />
<math><br />
\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}.<br />
</math><br />
<br />
This paper focuses strictly on the pitfalls in convergence of the ADAM optimizer from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The paper introduces the idea that it is possible for ADAM to get "stuck" in its weighted average history, preventing it from converging to an optimal solution. For example, in an experiment there may be a large spike in the gradient during some mini-batches. But since ADAM weighs the current update by the exponential moving averages of squared past gradients, the effect of the large spike in gradient is lost. To tackle these issues, several variants of ADAGRAD hav been proposed. The authors' analysis suggest that this can be prevented through novel but simple adjustments to the ADAM optimization algorithm, which can improve convergence. This paper is published in ICLR 2018.<br />
<br />
== Notation ==<br />
The paper presents the following framework as a generalization to all training algorithms, allowing us to fully define any specific variant such as AMSGrad or SGD entirely within it:<br />
<br />
[[File:training_algo_framework.png|700px|center]]<br />
<br />
Where we have <math> x_t </math> as our network parameters defined within a vector space <math> \mathcal{F} </math>. <math> \prod_{\mathcal{F}} (y) = </math> the projection of <math> y </math> on to the set <math> \mathcal{F} </math>. It should be noted that the <math>\sqrt{V_t}</math> in the expression <math> \prod_{\mathcal{F}, \sqrt{V_t}}</math> has been included as a typo.<br />
<math> \psi_t </math> and <math> \phi_t </math> correspond to arbitrary functions we will provide later, The former maps from the history of gradients to <math> \mathbb{R}^d </math> and the latter maps from the history of the gradients to positive semi definite matrices. And finally <math> f_t </math> is our loss function at some time <math> t </math>, the rest should be pretty self explanatory. Using this framework and defining different <math> \psi_t </math> , <math> \phi_t </math> will allow us to recover all different kinds of training algorithms under this one roof.<br />
<br />
=== SGD As An Example ===<br />
To recover SGD using this framework we simply select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = I </math> and <math>\alpha_t = \alpha / \sqrt{t}</math>. It is easy to see that no transformations are ultimately applied to any of the parameters based on any gradient history other than the most recent from <math> \phi_t </math> and that <math> \psi_t </math> in no way transforms any of the parameters by any specific amount as <math> V_t = I </math> has no impact later on.<br />
<br />
=== ADAGRAD As Another Example ===<br />
<br />
To recover ADAGRAD, we select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = \frac{\sum_{i=1}^{t} g_i^2}{t} </math>, and <math>\alpha_t = \alpha / \sqrt{t}</math>. Therefore, compared to SGD, ADAGRAD uses a different step size for each parameter, based on the past gradients for that parameter; the learning rate becomes <math> \alpha_t = \alpha / \sqrt{\sum_i g_{i,j}^2} </math> for each parameter <math> j </math>. The authors note that this scheme is quite efficient when the gradients are sparse.<br />
<br />
=== ADAM As Another Example ===<br />
Once you can convince yourself that the recovery of SGD from the generalized framework is correct, you should understand the framework enough to see why the following setup for ADAM will allow us to recover the behaviour we want. ADAM has the ability to define a "learning rate" for every parameter based on how much that parameter moves over time (a.k.a its momentum) supposedly to help with the learning process.<br />
<br />
In order to do this, we will choose <math> \phi_t (g_1, \dotsc, g_t) = (1 - \beta_1) \sum_{i=0}^{t} {\beta_1}^{t - i} g_t </math>, psi to be <math> \psi_t (g_1, \dotsc, g_t) = (1 - \beta_2)</math>diag<math>( \sum_{i=0}^{t} {\beta_2}^{t - i} {g_t}^2) </math>, and keep <math>\alpha_t = \alpha / \sqrt{t}</math>. This setup is equivalent to choosing a learning rate decay of <math>\alpha / \sqrt{\sum_i g_{i,j}}</math> for <math>j \in [d]</math>.<br />
<br />
From this, we can now see that <math>m_t </math> gets filled up with the exponentially weighted average of the history of our gradients that we have come across so far in the algorithm. And that as we proceed to update we scale each one of our parameters by dividing out <math> V_t </math> (in the case of diagonal it is just one over the diagonal entry) which contains the exponentially weighted average of each parameter's momentum (<math> {g_t}^2 </math>) across our training so far in the algorithm. Thus each parameter has its own unique scaling by its second moment or momentum. Intuitively, from a physical perspective, if each parameter is a ball rolling around in the optimization landscape what we are now doing is instead of having the ball change positions on the landscape at a fixed velocity (i.e. momentum of 0) the ball now has the ability to accelerate and speed up or slow down if it is on a steep hill or flat trough in the landscape (i.e. a momentum that can change with time).<br />
<br />
= <math> \Gamma_t </math>, an Interesting Quantity =<br />
Now that we have an idea of what ADAM looks like in this framework, let us now investigate the following:<br />
<br />
<center><math> \Gamma_{t + 1} = \frac{\sqrt{V_{t+1}}}{\alpha_{t+1}} - \frac{\sqrt{V_t}}{\alpha_t} </math></center><br />
<br />
Which essentially measure the change of the "Inverse of the learning rate" across time (since we are using alpha's as step sizes). A key observation is that for SGD and ADAGRAD, <math>\Gamma_t \succeq 0</math> for all <math>t \in [T]</math>, which simply follows from the update rules of SGD and ADAGRAD. Looking back to our example of SGD it's not hard to see that this quantity is strictly positive semidefinite, which leads to "non-increasing" learning rates, which is a desired property. However, that is not the case with ADAM, and can pose a problem in a theoretical and applied setting. The problem ADAM can face is that <math> \Gamma_t </math> can potentially be indefinite for <math>t \in [T]</math>, which the original proof assumed it could not be. The math for this proof is VERY long so instead we will opt for an example to showcase why this could be an issue.<br />
<br />
Consider the loss function <math> f_t(x) = \begin{cases} <br />
Cx & \text{for } t \text{ mod 3} = 1 \\<br />
-x & \text{otherwise}<br />
\end{cases} </math><br />
<br />
Where we have <math> C > 2 </math> and <math> \mathcal{F} </math> is <math> [-1,1] </math>. Additionally we choose <math> \beta_1 = 0 </math> and <math> \beta_2 = 1/(1+C^2) </math>. We then proceed to plug this into our framework from before. This function is periodic and it's easy to see that it has the gradient of C once and then a gradient of -1 twice every period. It has an optimal solution of <math> x = -1 </math> (from a regret standpoint), but using ADAM we would eventually converge at <math> x = 1 </math>, since <math> \psi_t </math> would scale down the <math> C </math> by a factor of almost <math> C </math> so that it's unable to "overpower" the multiple -1's.<br />
<br />
We formalize this intuition in the results below.<br />
<br />
'''Theorem 1.''' There is an online convex optimization problem where ADAM has non-zero average regret. i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
One might think that adding a small constant in the denominator of the update function can help avoid this issue by modifying the update for ADAM as follow:<br />
\begin{align}<br />
\hat x_{t+1} = x_t - \alpha_t m_t/\sqrt{V_t + \epsilon \mathbb{I}}<br />
\end{align}<br />
<br />
The selection of <math>\epsilon</math> appears to be crucial for the performance of the algorithm in practice. However, this work shows that for any constant <math>\epsilon > 0</math>, there exists an online optimization setting where ADAM has non-zero average regret asymptotically.<br />
<br />
'''Theorem 2.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is an online convex optimization problem where ADAM has non-zero average regret i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
The theorem shows that the convergence of the algorithm to the optimal solution will not be improved by momentum or regularization via <math> \varepsilon </math> with constant <math> \beta_1 </math> and <math> \beta_2</math>.<br />
<br />
<br />
'''Theorem 3.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is a stochastic convex optimization problem for which ADAM does not converge to the optimal solution. <br />
<br />
Kingama & Ba (2015) mentioned that the analysis of ADAM relies on decreasing <math> \beta_1 </math> over time. As <math> \beta_2 </math> is the critical parameter, the examples could be easily extended to the case where <math> \beta_1 </math> is decreasing over time. The paper only focus on proving non-convergence of ADAM when <math> \beta_1 </math> is constant.<br />
<br />
= AMSGrad as an improvement to ADAM =<br />
There is a very simple intuitive fix to ADAM to handle this problem. We simply scale our historical weighted average by the maximum we have seen so far to avoid the negative sign problem. There is a very simple one-liner adaptation of ADAM to get to AMSGRAD:<br />
[[File:AMSGrad_algo.png|700px|center]]<br />
<br />
Below are some simple plots comparing ADAM and AMSGrad, the first are from the paper and the second are from another individual who attempted to recreate the experiments. The two plots somewhat disagree with one another so take this heuristic improvement with a grain of salt.<br />
<br />
[[File:AMSGrad_vs_adam.png|900px|center]]<br />
<br />
Here is another example of a one-dimensional convex optimization problem where ADAM fails to converge<br />
<br />
[[File:AMSGrad_vs_adam3.png|900px|center]]<br />
<br />
[[File:AMSGrad_vs_adam2.png|700px|center]]<br />
<br />
= Extension: ADAMNC Algorithm =<br />
<br />
An alternative approach is to use an increasing schedule of <math> \beta_2 </math> in ADAM. This approach, unlike Algorithm 2 does not require changing the structure of ADAM but rather uses a non-constant <math> \beta_1 </math>and <math> \beta_2 </math>. The pseudocode for the algorithm, ADAMNC, is provided in the Algorithm 3. We show that by appropriate selection of <math> \beta_1^t </math> and <math> \beta_2^t </math>, we can achieve good convergence rates.<br />
<br />
[[File:ADAMNC_METHOD.png|700px|center]]<br />
<br />
= Conclusion =<br />
The authors have introduced a framework for which they could view several different training algorithms. From there they used it to recover SGD as well as ADAM. In their recovery of ADAM the authors investigated the change of the inverse of the learning rate over time to discover in certain cases there were convergence issues. They proposed a new heuristic AMSGrad to help deal with this problem and presented some empirical results that show it may have helped ADAM slightly. Thanks for your time.<br />
<br />
== Critique ==<br />
The contrived example which serves as the intuition to illustrate the failure of ADAM is not convincing, since we can construct similar failure examples for SGD as well. <br />
Consider the loss function <br />
<br />
<math> f_t(x) = \begin{cases} <br />
-x & \text{for } t \text{ mod 2} = 1 \\<br />
-\frac{1}{2} x^2 & \text{otherwise}<br />
\end{cases} <br />
</math><br />
<br />
where <math> x \in \mathcal{F} = [-a, 1], a \in [1, \sqrt{2}) </math>. The optimal solution is <math>x=1</math>, but starting from initial point <math>x_{t=0} \le -1</math>, SGD will converge to <math>x = -a</math><br />
<br />
The author also fail to explain why ADAM is popular in experiments, why it works better than other optimizer in certain situations.<br />
<br />
==Implementation == <br />
Keras implementation of AMSGrad : https://gist.github.com/kashif/3eddc3c90e23d84975451f43f6e917da<br />
<br />
= Source =<br />
1. Sashank J. Reddi and Satyen Kale and Sanjiv Kumar. "On the Convergence of Adam and Beyond." International Conference on Learning Representations. 2018</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ADAMNC_METHOD.png&diff=36356File:ADAMNC METHOD.png2018-04-20T20:53:57Z<p>Ws2chen: </p>
<hr />
<div></div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond&diff=36354On The Convergence Of ADAM And Beyond2018-04-20T20:52:35Z<p>Ws2chen: /* Extension: ADAMNC Algorithm */</p>
<hr />
<div>= Introduction =<br />
Stochastic gradient descent (SGD) is currently the dominant method of training deep networks. Variants of SGD that scale the gradients using information from past gradients have been very successful, since the learning rate is adjusted on a per-feature basis, with ADAGRAD being one example. However, ADAGRAD performance deteriorates when loss functions are nonconvex and gradients are dense. Several variants of ADAGRAD, such as RMSProp, ADAM, ADADELTA, and NADAM have been proposed, which address the issue by using exponential moving averages of squared past gradients, thereby limiting the update to only rely on the past few gradients. The following formula shows the per-parameter update for which is then vectorized:<br />
<math><br />
g_{t, i} = \nabla_\theta J( \theta_{t, i} ).<br />
</math><br />
<br />
After vectorizing the update per-parameter using SGD becomes:<br />
<math><br />
\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i}.<br />
</math><br />
<br />
The update for the parameter in the next step is calculated using the matrix vector product:<br />
<math><br />
\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}.<br />
</math><br />
<br />
This paper focuses strictly on the pitfalls in convergence of the ADAM optimizer from a theoretical standpoint and proposes a novel improvement to ADAM called AMSGrad. The paper introduces the idea that it is possible for ADAM to get "stuck" in its weighted average history, preventing it from converging to an optimal solution. For example, in an experiment there may be a large spike in the gradient during some mini-batches. But since ADAM weighs the current update by the exponential moving averages of squared past gradients, the effect of the large spike in gradient is lost. To tackle these issues, several variants of ADAGRAD hav been proposed. The authors' analysis suggest that this can be prevented through novel but simple adjustments to the ADAM optimization algorithm, which can improve convergence. This paper is published in ICLR 2018.<br />
<br />
== Notation ==<br />
The paper presents the following framework as a generalization to all training algorithms, allowing us to fully define any specific variant such as AMSGrad or SGD entirely within it:<br />
<br />
[[File:training_algo_framework.png|700px|center]]<br />
<br />
Where we have <math> x_t </math> as our network parameters defined within a vector space <math> \mathcal{F} </math>. <math> \prod_{\mathcal{F}} (y) = </math> the projection of <math> y </math> on to the set <math> \mathcal{F} </math>. It should be noted that the <math>\sqrt{V_t}</math> in the expression <math> \prod_{\mathcal{F}, \sqrt{V_t}}</math> has been included as a typo.<br />
<math> \psi_t </math> and <math> \phi_t </math> correspond to arbitrary functions we will provide later, The former maps from the history of gradients to <math> \mathbb{R}^d </math> and the latter maps from the history of the gradients to positive semi definite matrices. And finally <math> f_t </math> is our loss function at some time <math> t </math>, the rest should be pretty self explanatory. Using this framework and defining different <math> \psi_t </math> , <math> \phi_t </math> will allow us to recover all different kinds of training algorithms under this one roof.<br />
<br />
=== SGD As An Example ===<br />
To recover SGD using this framework we simply select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = I </math> and <math>\alpha_t = \alpha / \sqrt{t}</math>. It is easy to see that no transformations are ultimately applied to any of the parameters based on any gradient history other than the most recent from <math> \phi_t </math> and that <math> \psi_t </math> in no way transforms any of the parameters by any specific amount as <math> V_t = I </math> has no impact later on.<br />
<br />
=== ADAGRAD As Another Example ===<br />
<br />
To recover ADAGRAD, we select <math> \phi_t (g_1, \dotsc, g_t) = g_t</math>, <math> \psi_t (g_1, \dotsc, g_t) = \frac{\sum_{i=1}^{t} g_i^2}{t} </math>, and <math>\alpha_t = \alpha / \sqrt{t}</math>. Therefore, compared to SGD, ADAGRAD uses a different step size for each parameter, based on the past gradients for that parameter; the learning rate becomes <math> \alpha_t = \alpha / \sqrt{\sum_i g_{i,j}^2} </math> for each parameter <math> j </math>. The authors note that this scheme is quite efficient when the gradients are sparse.<br />
<br />
=== ADAM As Another Example ===<br />
Once you can convince yourself that the recovery of SGD from the generalized framework is correct, you should understand the framework enough to see why the following setup for ADAM will allow us to recover the behaviour we want. ADAM has the ability to define a "learning rate" for every parameter based on how much that parameter moves over time (a.k.a its momentum) supposedly to help with the learning process.<br />
<br />
In order to do this, we will choose <math> \phi_t (g_1, \dotsc, g_t) = (1 - \beta_1) \sum_{i=0}^{t} {\beta_1}^{t - i} g_t </math>, psi to be <math> \psi_t (g_1, \dotsc, g_t) = (1 - \beta_2)</math>diag<math>( \sum_{i=0}^{t} {\beta_2}^{t - i} {g_t}^2) </math>, and keep <math>\alpha_t = \alpha / \sqrt{t}</math>. This setup is equivalent to choosing a learning rate decay of <math>\alpha / \sqrt{\sum_i g_{i,j}}</math> for <math>j \in [d]</math>.<br />
<br />
From this, we can now see that <math>m_t </math> gets filled up with the exponentially weighted average of the history of our gradients that we have come across so far in the algorithm. And that as we proceed to update we scale each one of our parameters by dividing out <math> V_t </math> (in the case of diagonal it is just one over the diagonal entry) which contains the exponentially weighted average of each parameter's momentum (<math> {g_t}^2 </math>) across our training so far in the algorithm. Thus each parameter has its own unique scaling by its second moment or momentum. Intuitively, from a physical perspective, if each parameter is a ball rolling around in the optimization landscape what we are now doing is instead of having the ball change positions on the landscape at a fixed velocity (i.e. momentum of 0) the ball now has the ability to accelerate and speed up or slow down if it is on a steep hill or flat trough in the landscape (i.e. a momentum that can change with time).<br />
<br />
= <math> \Gamma_t </math>, an Interesting Quantity =<br />
Now that we have an idea of what ADAM looks like in this framework, let us now investigate the following:<br />
<br />
<center><math> \Gamma_{t + 1} = \frac{\sqrt{V_{t+1}}}{\alpha_{t+1}} - \frac{\sqrt{V_t}}{\alpha_t} </math></center><br />
<br />
Which essentially measure the change of the "Inverse of the learning rate" across time (since we are using alpha's as step sizes). A key observation is that for SGD and ADAGRAD, <math>\Gamma_t \succeq 0</math> for all <math>t \in [T]</math>, which simply follows from the update rules of SGD and ADAGRAD. Looking back to our example of SGD it's not hard to see that this quantity is strictly positive semidefinite, which leads to "non-increasing" learning rates, which is a desired property. However, that is not the case with ADAM, and can pose a problem in a theoretical and applied setting. The problem ADAM can face is that <math> \Gamma_t </math> can potentially be indefinite for <math>t \in [T]</math>, which the original proof assumed it could not be. The math for this proof is VERY long so instead we will opt for an example to showcase why this could be an issue.<br />
<br />
Consider the loss function <math> f_t(x) = \begin{cases} <br />
Cx & \text{for } t \text{ mod 3} = 1 \\<br />
-x & \text{otherwise}<br />
\end{cases} </math><br />
<br />
Where we have <math> C > 2 </math> and <math> \mathcal{F} </math> is <math> [-1,1] </math>. Additionally we choose <math> \beta_1 = 0 </math> and <math> \beta_2 = 1/(1+C^2) </math>. We then proceed to plug this into our framework from before. This function is periodic and it's easy to see that it has the gradient of C once and then a gradient of -1 twice every period. It has an optimal solution of <math> x = -1 </math> (from a regret standpoint), but using ADAM we would eventually converge at <math> x = 1 </math>, since <math> \psi_t </math> would scale down the <math> C </math> by a factor of almost <math> C </math> so that it's unable to "overpower" the multiple -1's.<br />
<br />
We formalize this intuition in the results below.<br />
<br />
'''Theorem 1.''' There is an online convex optimization problem where ADAM has non-zero average regret. i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
One might think that adding a small constant in the denominator of the update function can help avoid this issue by modifying the update for ADAM as follow:<br />
\begin{align}<br />
\hat x_{t+1} = x_t - \alpha_t m_t/\sqrt{V_t + \epsilon \mathbb{I}}<br />
\end{align}<br />
<br />
The selection of <math>\epsilon</math> appears to be crucial for the performance of the algorithm in practice. However, this work shows that for any constant <math>\epsilon > 0</math>, there exists an online optimization setting where ADAM has non-zero average regret asymptotically.<br />
<br />
'''Theorem 2.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is an online convex optimization problem where ADAM has non-zero average regret i.e. <math>R_T/T\nrightarrow 0 </math> as <math>T\rightarrow \infty</math>.<br />
<br />
The theorem shows that the convergence of the algorithm to the optimal solution will not be improved by momentum or regularization via <math> \varepsilon </math> with constant <math> \beta_1 </math> and <math> \beta_2</math>.<br />
<br />
<br />
'''Theorem 3.''' For any constant <math>\beta_1,\beta_2 \in [0,1)</math> such that <math>\beta_2 < \sqrt{\beta_2}</math>, there is a stochastic convex optimization problem for which ADAM does not converge to the optimal solution. <br />
<br />
Kingama & Ba (2015) mentioned that the analysis of ADAM relies on decreasing <math> \beta_1 </math> over time. As <math> \beta_2 </math> is the critical parameter, the examples could be easily extended to the case where <math> \beta_1 </math> is decreasing over time. The paper only focus on proving non-convergence of ADAM when <math> \beta_1 </math> is constant.<br />
<br />
= AMSGrad as an improvement to ADAM =<br />
There is a very simple intuitive fix to ADAM to handle this problem. We simply scale our historical weighted average by the maximum we have seen so far to avoid the negative sign problem. There is a very simple one-liner adaptation of ADAM to get to AMSGRAD:<br />
[[File:AMSGrad_algo.png|700px|center]]<br />
<br />
Below are some simple plots comparing ADAM and AMSGrad, the first are from the paper and the second are from another individual who attempted to recreate the experiments. The two plots somewhat disagree with one another so take this heuristic improvement with a grain of salt.<br />
<br />
[[File:AMSGrad_vs_adam.png|900px|center]]<br />
<br />
Here is another example of a one-dimensional convex optimization problem where ADAM fails to converge<br />
<br />
[[File:AMSGrad_vs_adam3.png|900px|center]]<br />
<br />
[[File:AMSGrad_vs_adam2.png|700px|center]]<br />
<br />
= Extension: ADAMNC Algorithm =<br />
<br />
An alternative approach is to use an increasing schedule of <math> \beta_2 </math> in ADAM. This approach, unlike Algorithm 2 does not require changing the structure of ADAM but rather uses a non-constant <math> \beta_1 </math>and <math> \beta_2 </math>. The pseudocode for the algorithm, ADAMNC, is provided in the Algorithm 3. We show that by appropriate selection of <math> \beta_1^t </math> and <math> \beta_2^t </math>, we can achieve good convergence rates.<br />
<br />
[[File:ADAMNC METHOD.png|700px|center]]<br />
<br />
= Conclusion =<br />
The authors have introduced a framework for which they could view several different training algorithms. From there they used it to recover SGD as well as ADAM. In their recovery of ADAM the authors investigated the change of the inverse of the learning rate over time to discover in certain cases there were convergence issues. They proposed a new heuristic AMSGrad to help deal with this problem and presented some empirical results that show it may have helped ADAM slightly. Thanks for your time.<br />
<br />
== Critique ==<br />
The contrived example which serves as the intuition to illustrate the failure of ADAM is not convincing, since we can construct similar failure examples for SGD as well. <br />
Consider the loss function <br />
<br />
<math> f_t(x) = \begin{cases} <br />
-x & \text{for } t \text{ mod 2} = 1 \\<br />
-\frac{1}{2} x^2 & \text{otherwise}<br />
\end{cases} <br />
</math><br />
<br />
where <math> x \in \mathcal{F} = [-a, 1], a \in [1, \sqrt{2}) </math>. The optimal solution is <math>x=1</math>, but starting from initial point <math>x_{t=0} \le -1</math>, SGD will converge to <math>x = -a</math><br />
<br />
The author also fail to explain why ADAM is popular in experiments, why it works better than other optimizer in certain situations.<br />
<br />
==Implementation == <br />
Keras implementation of AMSGrad : https://gist.github.com/kashif/3eddc3c90e23d84975451f43f6e917da<br />
<br />
= Source =<br />
1. Sashank J. Reddi and Satyen Kale and Sanjiv Kumar. "On the Convergence of Adam and Beyond." International Conference on Learning Representations. 2018</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=36345MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-04-20T20:39:55Z<p>Ws2chen: /* Method */</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce re-projection consistency between the 3D shape and the estimated sketch. 2.5D is the construction of a 3D environment using 2D retina projection along with depth perception obtained from the image. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth maps (contains information related to the distance of surfaces from a viewpoint) and surface normal maps (technique for adding the illusion of depth details to surfaces using an image's RGB information), the authors design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
[[File:2-5d_example.PNG|700px|thumb|center|Results from the paper: Learning Non-Lambertian Object Intrinsics across ShapeNet Categories. The results show that neural networks can be trained to recover 2.5D information from an image. The top row predicts the albedo and the bottom row predicts the shading. It can be observed that the results are still blurry and the fine details are not fully recovered.]]<br />
<br />
=== Notes: 2.5D === <br />
<br />
Two and a half dimensional (shortened to 2.5D, known alternatively as three-quarter perspective and pseudo-3D) is a term used to describe either 2D graphical projections and similar techniques used to cause images to simulate the appearance of being three-dimensional (3D) when in fact they are not, or gameplay in an otherwise three-dimensional video game that is restricted to a two-dimensional plane or has a virtual camera with fixed angle.<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. A voxel is an abbreviation for volume element, the three-dimensional version of a pixel. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, and has been widely used in 3D shape completion with the use of depths and silhouettes. A few recent papers [5,6,7,8] discussed enforcing differentiable 2D-3D constraints between shape and silhouettes to enable joint training of deep networks for the task of 3D reconstruction. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in the figure below. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the 2.5 sketch with surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information such as texture and lighting. An encoder-decoder architecture is used. The encoder is a A ResNet-18 network, which takes a 256 x 256 RGB image and produces 512 feature maps of size 8 x 8. The decoder is four sets of 5 x 5 fully convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem since it only takes in surface normal and depth images as input. The network architecture is inspired by the TL[10] network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 fully convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. This ensures the estimated 3D shape matches the estimated depth values. The projected depth loss and its gradient are defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0 when there is no intersection between the line and its shape, referred as the silhouette criterion.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
Gradients along y are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y-1, z+\frac{n_b}{n_c}}} = 2(v_{x, y-1, z+\frac{n_b}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y+1, z-\frac{n_b}{n_c}}} = 2(v_{x, y+1, z-\frac{n_b}{n_c}}-1)<br />
</math><br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets; ShapeNet, PASCAL 3D+, and IKEA. Intersection-over-Union (IoU) is the main measurement of comparison between the models. However the authors note that models which focus on the IoU metric fail to capture the details of the object they are trying to model, disregarding details to focus on the overall shape. To counter this drawback they poll people on which reconstruction is preferred. IoU is also computationally inefficient since it has to check over all possible scales.<br />
<br />
== ShapeNet ==<br />
The data is based on synthesized images of ShapeNet chairs [Chang et al., 2015]. From the SUN database [Xiao et al., 2010], they combine the chars with random backgrounds and use a physics-based renderer by Jakob to render the corresponding RGB, depth, surface normal, and silhouette images.<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained following the training paradigm defined previously but without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation. Specifically, the 2.5D sketch estimator is trained using ground truth depth, normal and silhouette images and a L2 reconstruction loss. The 3D shape estimation module takes in the masked ground truth depth and normal images as input, and predicts 3D voxels of size 128×128×128 with a binary cross entropy loss.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. The estimated normal and depth images are able to extract intrinsic information about object shape while leaving behind non-essential information such as textures from the original images. Quantitatively, the full model also achieves 0.57 integer over union score (which compares the overlap of the predicted model and ground truth), which is higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Also followed the paradigm described and train each module separately on the ShapeNet dataset. Then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. Since PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|400px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder. Therefore, future work should address more "difficult" shapes and forms; it should be more difficult to generalize shapes that are more complex than furniture.<br />
<br />
Also there is ambiguity in terms of how the aforementioned self-supervision can work as the authors claim that the model can be fine-tuned using a single image itself. If the parameters are constrained to a single image, then it means it will not generalize well. It is not clearly explained as to what can be fine-tuned.<br />
<br />
The paper does not propose or implement a baseline model to which MarrNet should be compared.<br />
<br />
The model uses information from a single image. 3D shape estimation in biological agents incorporates information from multiple images or even video. A logical next step for improving this model would be to include images of the object from multiple angles.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper. The repository provides the source code as written by the authors: https://github.com/jiajunwu/marrnet<br />
<br />
= References =<br />
# Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T. Freeman, Joshua B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches, 2017<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.<br />
# Wu, J. (n.d.). Jiajunwu/marrnet. Retrieved March 25, 2018, from https://github.com/jiajunwu/marrnet<br />
# Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single image 3d interpreter network. In ECCV, 2016a.<br />
# Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, 2016.<br />
# Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NIPS, 2016.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# Rohit Girdhar, David F. Fouhey, Mikel Rodriguez and Abhinav Gupta, Learning a Predictable and Generative Vector Representation for Objects, in ECCV 2016<br />
#Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015. <br />
#Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. <br />
#Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=36344MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-04-20T20:33:10Z<p>Ws2chen: /* Method */</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce re-projection consistency between the 3D shape and the estimated sketch. 2.5D is the construction of a 3D environment using 2D retina projection along with depth perception obtained from the image. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth maps (contains information related to the distance of surfaces from a viewpoint) and surface normal maps (technique for adding the illusion of depth details to surfaces using an image's RGB information), the authors design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
[[File:2-5d_example.PNG|700px|thumb|center|Results from the paper: Learning Non-Lambertian Object Intrinsics across ShapeNet Categories. The results show that neural networks can be trained to recover 2.5D information from an image. The top row predicts the albedo and the bottom row predicts the shading. It can be observed that the results are still blurry and the fine details are not fully recovered.]]<br />
<br />
=== Notes: 2.5D === <br />
<br />
Two and a half dimensional (shortened to 2.5D, known alternatively as three-quarter perspective and pseudo-3D) is a term used to describe either 2D graphical projections and similar techniques used to cause images to simulate the appearance of being three-dimensional (3D) when in fact they are not, or gameplay in an otherwise three-dimensional video game that is restricted to a two-dimensional plane or has a virtual camera with fixed angle.<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. A voxel is an abbreviation for volume element, the three-dimensional version of a pixel. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, and has been widely used in 3D shape completion with the use of depths and silhouettes. A few recent papers [5,6,7,8] discussed enforcing differentiable 2D-3D constraints between shape and silhouettes to enable joint training of deep networks for the task of 3D reconstruction. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in the figure below. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the 2.5 sketch with surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information such as texture and lighting. An encoder-decoder architecture is used. The encoder is a A ResNet-18 network, which takes a 256 x 256 RGB image and produces 512 feature maps of size 8 x 8. The decoder is four sets of 5 x 5 fully convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem since it only takes in surface normal and depth images as input. The network architecture is inspired by the TL[10] network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 fully convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. This ensures the estimated 3D shape matches the estimated depth values. The projected depth loss and its gradient are defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0 when there is no intersection between the line and its shape, referred as the silhouette criterion.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
Gradients along y are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y-1, z+\frac{n_b}{n_c}}} = 2(v_{x, y-1, z+\frac{n_b}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x, y+1, z-\frac{n_b}{n_c}}} = 2(v_{x, y+1, z-\frac{n_b}{n_c}}-1)<br />
</math><br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets; ShapeNet, PASCAL 3D+, and IKEA. Intersection-over-Union (IoU) is the main measurement of comparison between the models. However the authors note that models which focus on the IoU metric fail to capture the details of the object they are trying to model, disregarding details to focus on the overall shape. To counter this drawback they poll people on which reconstruction is preferred. IoU is also computationally inefficient since it has to check over all possible scales.<br />
<br />
== ShapeNet ==<br />
The data is based on synthesized images of ShapeNet chairs [Chang et al., 2015]. From the SUN database [Xiao et al., 2010], they combine the chars with random backgrounds and use a physics-based renderer by Jakob to render the corresponding RGB, depth, surface normal, and silhouette images.<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained following the training paradigm defined previously but without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation. Specifically, the 2.5D sketch estimator is trained using ground truth depth, normal and silhouette images and a L2 reconstruction loss. The 3D shape estimation module takes in the masked ground truth depth and normal images as input, and predicts 3D voxels of size 128×128×128 with a binary cross entropy loss.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. The estimated normal and depth images are able to extract intrinsic information about object shape while leaving behind non-essential information such as textures from the original images. Quantitatively, the full model also achieves 0.57 integer over union score (which compares the overlap of the predicted model and ground truth), which is higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. Since PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|400px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder. Therefore, future work should address more "difficult" shapes and forms; it should be more difficult to generalize shapes that are more complex than furniture.<br />
<br />
Also there is ambiguity in terms of how the aforementioned self-supervision can work as the authors claim that the model can be fine-tuned using a single image itself. If the parameters are constrained to a single image, then it means it will not generalize well. It is not clearly explained as to what can be fine-tuned.<br />
<br />
The paper does not propose or implement a baseline model to which MarrNet should be compared.<br />
<br />
The model uses information from a single image. 3D shape estimation in biological agents incorporates information from multiple images or even video. A logical next step for improving this model would be to include images of the object from multiple angles.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= Implementation =<br />
The following repository provides the source code for the paper. The repository provides the source code as written by the authors: https://github.com/jiajunwu/marrnet<br />
<br />
= References =<br />
# Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T. Freeman, Joshua B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches, 2017<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.<br />
# Wu, J. (n.d.). Jiajunwu/marrnet. Retrieved March 25, 2018, from https://github.com/jiajunwu/marrnet<br />
# Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single image 3d interpreter network. In ECCV, 2016a.<br />
# Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, 2016.<br />
# Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NIPS, 2016.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# Rohit Girdhar, David F. Fouhey, Mikel Rodriguez and Abhinav Gupta, Learning a Predictable and Generative Vector Representation for Objects, in ECCV 2016<br />
#Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015. <br />
#Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. <br />
#Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations&diff=36337Understanding Image Motion with Group Representations2018-04-20T20:09:06Z<p>Ws2chen: /* Related Work */</p>
<hr />
<div>== Introduction ==<br />
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformations caused by motion, therefore unrealistic motion caused by say, face swapping, is not considered. <br />
<br />
To be useful for understanding and acting on scene motion, a representation should capture the motion of the observer and all relevant scene content. Supervised training of such a representation is challenging: explicit motion labels are difficult to obtain, especially for nonrigid scenes where it can be unclear how the structure and motion of the scene should be decomposed. The proposed learning method does not need labeled data. Instead, the method applies constraints to learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and invertibility can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captures motion in both 2D and 3D settings. This method can be used to extract useful information for vehicle localization, tracking, and odometry.<br />
<br />
[[File:paper13_fig1.png|650px|center|]]<br />
<br />
== Related Work ==<br />
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represent poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. The most used method for local representation is optical flow, which estimates motion by pixel over 2-D image. Furthermore, scene flow is a more generalized method of optical flow which estimates the point trajectories from 3-D motions. The limitation of optical flow is that it only captures motion locally, which makes capturing the overall motion impossible. <br />
<br />
Another approach to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions since there is usually a dimensionality reduction process involved. However these approaches are restricted to fixed windows of representation. <br />
<br />
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.<br />
<br />
By learning learning representations using visual structure, recent works have used knowledge of the geometric or spatial structure of images or scenes<br />
to train representations. For example, one can trains a CNN to classify the correct configuration of image patches to learn the relationship between an image’s patches and its semantic content. The resulting representation can be fine-tuned for image classification. There are also other works that learn from sequences typically focus on static image content rather than motion<br />
<br />
== Approach ==<br />
The proposed method is based on the observation that 3D motions, equipped with composition, form a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene. The method is designed to capture group associativity and invertibility.<br />
<br />
Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this. To be more specific, the latent structure of a scene from rigid image motion could be modelled by a point cloud with a motion space <math>M=SE(3)</math>, where rigid image motion can be produced by a camera translating and rotating through a rigid scene in 3D. When a scene has N rigid bodies, the motion space can be represented as <math>M=[SE(3)]^N</math>.<br />
<br />
=== Learning Motion by Group Properties ===<br />
The goal is to learn a function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating mapping of image pairs from <math> M </math> to its representation, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that emulates the composition of these elements in <math> M </math>. For all sequences, it is assumed that for all times <math> t_0 < t_1 < t_2 ... </math>, the sequence representation should have the following properties: <br />
# Associativity: <math> \Phi(I_{t_0}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3})</math>, which means that the motion of differently composed subsequences of a sequence are equivalent<br />
# Has Identity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) </math> and <math> e=\Phi(I_{t}, I_{t}) \forall t </math>, where <math>e</math> is the null image motion and the unique identity in the latent space<br />
# Invertibility: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e </math>, so the inverse of the motion of an image sequence is the motion of that image sequence reversed<br />
<br />
Also note that a notion of transitivity is assumed, specifically <math>\Phi(I_{t_0}, I_{t_2}) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})</math>.<br />
<br />
An embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing sequences with the same final motion but different transitions to the same representation. Invertibility is encouraged by pushing sequences corresponding to the same motion with but in opposite directions away from each other, as well as pushing all loops to the same representation. Uniqueness of the identity is encouraged by pushing loops away from non-identity representations. Loops from different sequences are also pushed to the same representation (the identity).<br />
<br />
These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured. On the other hand, optical flow assumes unchanging brightness between frames of the same projected scene, and motion estimates would degrade when that assumption does not hold.<br />
<br />
Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing content, scene cuts, or long-term occlusions.<br />
<br />
=== Sequence Learning with Neural Networks ===<br />
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images and generates intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce an embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final time step is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences as defined below:<br />
<br />
<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center><br />
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center><br />
<br />
where <math>d(R^1,R^2)</math> measure the distance between the embeddings of two sequences used for training selected to be cosine distance, <math> m </math> is a fixed scalar margin selected to be 0.5. Positive pairs are training examples where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.<br />
<br />
Each training sequence is recomposed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scales. Training the network with only one input images per time step was also tried, but consistently yielded work results than image pairs.<br />
<br />
[[File:paper13_fig2.png|650px|center|]]<br />
<br />
Overall, training with image pairs resulted in lower error than training with just single images. This is demonstrated in the below table.<br />
<br />
<br />
[[File:table.png|700px|center|]]<br />
<br />
== Experimentation ==<br />
Trained network using rotated and translated MNIST dataset as well as KITTI dataset. <br />
* Used Torch<br />
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach<br />
* 50-60 batch size for MNIST, 25-30 batch size for KITTI<br />
* Dilated convolution with Relu and batch normalization<br />
* Two LSTM cell per layer 256 hidden units each<br />
* Sequence length of 3-5 images<br />
* MINIST networks with up to 12 images <br />
<br />
=== Rigid Motion in 2D ===<br />
* MNIST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations<br />
* Visualized the representation using t-SNE<br />
** Clear clustering by translation and rotation but not object classes<br />
** Suggests the representation captures the motion properties in the dataset, but is independent of image contents<br />
* Visualized the image-conditioned saliency maps<br />
**In Figure 3, the red represents the positive gradients of the activation function with respect to the input image, and the negative gradients are represented in blue.<br />
**If we consider a saliency map as a first-order Taylor expansion, then the map could show the relationship between pixel and the representation.<br />
** Take derivative of the network output respect to the map<br />
** The area that has the highest gradient means that part contributes the most to the output<br />
** The resulting salient map strongly resembles spatiotemporal energy filters of classical motion processing<br />
** Suggests the network is learning the right motion structure<br />
<br />
[[File:paper13_fig3.png|700px|center|]]<br />
<br />
=== Real World Motion in 3D ===<br />
* Uses KITTI dataset collected on a car driving through roads in Germany<br />
* On a separate dataset with ground truth camera pose, linearly regress the representation to the ground truth<br />
** The result is compared against self supervised flow algorithm Yu et al.(2016) after the output from the flow algorithm is downsampled, then feed through PCA, then regressed against the camera motion<br />
** The data shows it performs not as well as the supervised algorithm, but consistently better than chance (guessing the mean value)<br />
** Largest improvements are shown in X and Z translation, which also have the most variance in the data<br />
** Shows the method is able to capture dominant motion structure<br />
* Test performance on interpolation task<br />
** Check <math>R([I_1,I_T])</math> against <math>R([I_1, I_m, I_T])</math>, <math>R([I_1, I_{IN}, I_T])</math>, and <math>R([I_1, I_{OUT}, I_T])</math><br />
** Test how sensitive the network is to deviations from unnatural motion<br />
** High errors <math>\gg 1</math> means the network can distinguish between realistic and unrealistic motion<br />
** In order to do this, the distance between the embeddings of the frame sequences of the first and last frame <math>R([I_1,I_T])</math> and of the first, middle, and last frame <math>R([I_1, I_m, I_T])</math> is computed. This distance is compared with the distance when the middle frame of the second embedding is changed to a frame that is visually similar (inside sequence): <math>R([I_1, I_{IN}, I_T])</math> and one that is visually dissimilar (outside sequence): <math>R([I_1, I_{OUT}, I_T])</math>. The results are shown in Table 3. The embedding distance method is compared to the Euclidean distance which is defined as the mean pixel distance between the test frame and <math>{I_1,I_T}</math>, whichever is smaller. It can be seen from the results that the embedding distance of the true frame is significantly lower than other frames. This means that the embedding distance used in the network is more sensitive to any atypical motions of the scenes. <br />
* Visualized saliency maps<br />
** Highlights objects moving in the background, and motion of the car in the foreground<br />
** Suggests the method can be used for tracking as well<br />
<br />
[[File:paper13_tab2.png|700px|center|]]<br />
<br />
[[File:paper13_fig4.png|700px|center|]]<br />
<br />
[[File:paper13_fig5.png|700px|center|]]<br />
<br />
[[File:table3_motion.PNG|700px|center|]]<br />
<br />
* Figure 7 displays graphs comparing the mean squared error of the method presented in this paper to the baseline chance method and the supervised Flow PCA method.<br />
<br />
[[File:paper13_fig6.PNG|700px|center|]]<br />
<br />
== Conclusion ==<br />
The authors presented a new model of motion and a method for learning motion representations. It is shown that by enforcing group properties we can learn motion representations that are able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to a large variety of tasks.<br />
<br />
== Criticism ==<br />
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The idea of training using unlabelled data is interesting and it does have meaningful practical application. Unfortunately, the author did not provide convincing experimental results. Results from motion estimation problems are typically compared against ground truth data for their accuracy. The author performed experiments on transformed MNIST data and KITTI data. The MNIST data is transformed by the author, thus the ground truth is readily available. However the author only claimed the validity of the results through indirect means of using t-SNE and saliency map visualization. For the KITTI dataset, the author regressed the representations against ground truth for some mapping from the network output to some physical motion representation. Again, the results were compared only indirectly against ground truth, also shows poor results when compared with the Flow+PCA baseline, especially for X and Z translations as well as Y rotation, which are the main elements of motion present in the KITTI dataset. Such experimentation made the method hardly convincing and applicable to real-world applications. In addition, the network does not output motion representations with physical meanings, making the proposed method useless for any real world applications.<br />
<br />
One of the motivations the authors use for this approach is that traditional SLAM formulations represent motion as a sequence of poses in <math> SE(3) </math>, and that they are unable to represent non-rigid or independent motions. There exist SLAM formulations that represent motion as [http://ieeexplore.ieee.org/document/7353368/ Gaussian processes], as well as [http://journals.sagepub.com/doi/abs/10.1177/0278364915585860 temporal basis functions], and it is quite [https://openslam.org/robotvision.html common] for inertial, monocular-camera SLAM problems to use a motion representation on <math> SIM(3) </math>, which is the group containing all scale-preserving transformations. A <math> SIM(3) </math> transformation is not, in general, rigid, so it is not true to say that modern SLAM is unable to represent non-rigid motions. Additionally, the saliency images from the KITTI experiment displaying network gradients on independently moving objects in the scene does not necessarily mean that the motion representation is capturing independent motion, it just means that the network representation is dependent on those pixels. As the authors did not provide an error comparison between images containing independent motions and those without, it is possible that these network gradients only contribute to error (in terms of the camera pose) instead of capturing independent motions.<br />
<br />
Another criticism is that the group-properties constraint the authors impose is too weak. Any set consisting of functions, their inverses, and the identity forms a group. While physical motions are one example of such a group, there are many valid groups that do not represent any coherent physical motions. That is, it's unclear whether group representations adequately describe the underlying mechanisms of the paper.<br />
<br />
Since the network has to learn both the group elements, <math>\overline{M}</math>, and the composition function, <math>\diamond</math>, associated with the group it is difficult to tell how each of them are performing. It would not be possible to perform a layer-by-layer ablation study to determine the individual contributions of the functions associated with each group.<br />
<br />
Finally, the method requires domain knowledge of the motion space and feature engineering for encoding it, which reduces the ease with which the method can be generalized to various tasks.<br />
<br />
== References ==<br />
Jaegle, A. (2018). Understanding image motion with group representations . ICLR. Retrieved from https://openreview.net/pdf?id=SJLlmG-AZ.<br />
<br />
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ''International Conference on Learning Representations (ICLR) Workshop'', 2013.<br />
<br />
Jason J Yu, Adam W Harley, and Konstantinos G Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. ''European Conference on Computer Vision (ECCV) Workshops'', 2016.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge&diff=36336Label-Free Supervision of Neural Networks with Physics and Domain Knowledge2018-04-20T18:13:49Z<p>Ws2chen: /* Experiments */</p>
<hr />
<div>== Introduction ==<br />
The requirement of large amounts of labeled training data limits the applications of machine learning. Neural networks, in particular, require large amounts of labeled data to work (LeCun, Bengio, and Hinton 2015[1]). Humans are often able to instead learn from high level instructions for how a task should be performed, or what the final result should look like. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs?<br />
<br />
[[File:c433li-1.png|300px|center]]<br />
<br />
Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:<br />
* a reduction in the amount of work spent labeling, <br />
* an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.<br />
<br />
The primary contribution in the paper is to demonstrate how constraint learning can be used to train neural networks, and to explore how to learn useful feature representations from raw data while avoiding trivial, low entropy solutions.<br />
<br />
== Problem Setup ==<br />
In a traditional supervised learning setting, we are given a training set <math>D=\{(x_1, y_1), \cdots, (x_n, y_n)\}</math> of <math>n</math> training examples. Each example is a pair <math>(x_i,y_i)</math> formed by an instance <math>x_i \in X</math> and the corresponding output (label) <math>y_i \in Y</math>. The goal is to learn a function <math>f: X \rightarrow Y</math> mapping inputs to outputs. To quantify performance, a loss function <math>\ell:Y \times Y \rightarrow \mathbb{R}</math> is provided, and a mapping is found via <br />
<br />
<center><math> f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) </math></center><br />
<br />
where the optimization is over a pre-defined class of functions <math>\mathcal{F}</math> (hypothesis class). In our case, <math>\mathcal{F}</math> will be (convolutional) neural networks parameterized by their weights. The loss could be for example <math>\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]</math>. By restricting the space of possible functions specifying the hypothesis class <math>\mathcal{F}</math>, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in <math>\mathcal{F}</math>, incorporating a regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math>, and solving for <math> f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)</math>. Typically, the regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math> specifies a preference for "simpler" functions (Occam's razor) to prevent overfitting the model on the training data.<br />
<br />
The focus is on the set of problems/domains where the problem is a complex environment having a complex representation of the output space, for example mapping an input image to the height of an object(since this leads to a complex output space) rather than simple binary classification problem.<br />
<br />
In this paper, prior knowledge on the structure of the outputs is modeled by providing a weighted constraint function <math>g:X \times Y \rightarrow \mathbb{R}</math>, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels <math>y</math> to evaluate <math>f^*</math>, labels may not be necessary to discover <math>f^*</math>. If prior knowledge informs us that outputs of <math>f^*</math> have other unique properties among functions in <math>\mathcal{F}</math>, we may use these properties for training rather than direct examples <math>y</math>. <br />
<br />
Specifically, an unsupervised approach where the labels <math>y_i</math> are not provided to us is considered, where a necessary property of the output <math>g</math> is optimized instead.<br />
<center><math>\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) </math></center><br />
<br />
If the optimizing the above equation is sufficient to find <math>\hat{f}^*</math>, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.<br />
<br />
== Experiments ==<br />
<br />
In this paper, the author introduced three contexts to map from inputs to outputs, without providing direct examples of those outputs. In the first two experiments (tracking an object in free fall and tracking the position of a walking man), the author constructed mappings from an image to the location of an object it contains. Learning is made possible by exploiting structure that holds in images over time. In third experiment (detecting objects with causal relationships), the author mapped an image to two boolean variables describing whether or not the image contains two special objects. While learning exploits the unique causal semantics existing between these objects. <br />
<br />
=== Tracking an object in free fall ===<br />
In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The dataset used as released by the authors can be found at [3]. The goal is to obtain a regression network mapping from <math>{R^{\text{height} \times \text{width} \times 3}} \rightarrow \mathbb{R}</math>, where <math>\text{height}</math> and <math>\text{width}</math> are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. This network is trained as a structured prediction problem operating on a sequence of <math>N</math> images to produce a sequence of <math>N</math> heights, <math>\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each piece of data <math>x_i</math> will be a vector of images, <math>\mathbf{x}</math>.<br />
Rather than supervising the network with direct labels, <math>\mathbf{y} \in \mathbb{R}^N</math>, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of <math>a = -9.8 m / s^2</math>, and the plot of the object's height over time will form a parabola:<br />
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center><br />
<br />
The idea is, given any trajectory of <math>N</math> height predictions, <math>f(\mathbf{x})</math>, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify <math>\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]</math>, the prediction produced by the fitted parabola is:<br />
<center><math> \text{argmin}_{v_0, y_0}\sum_i(y_i-y_0-v_0(i\Delta_t)-\frac{1}{2}a(i\Delta_t)^2) </math></center><br />
By the solution of ordinary least square estimation: <br />
<center><math> \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) </math></center><br />
<br />
where<br />
<center><br />
<math><br />
\mathbf{A} = <br />
\left[ {\begin{array}{*{20}c}<br />
\Delta t & 1 \\<br />
2\Delta t & 1 \\<br />
3\Delta t & 1 \\<br />
\vdots & \vdots \\<br />
N\Delta t & 1 \\<br />
\end{array} } \right]<br />
</math><br />
</center><br />
<br />
The constraint loss is then defined as<br />
<center><math>g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|</math></center><br />
<br />
Note that <math>\hat{y}</math> is not the ground truth labels. Because <math>g</math> is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover <math>f^*</math> up to an additive constant <math>C</math> (specifying what object height corresponds to 0).<br />
<br />
[[File:c433li-2.png|650px|center]]<br />
<br />
The data set is collected on a laptop webcam running at 10 frames per second (<math>\Delta t = 0.1s</math>). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of <math>N=5</math> contiguous frames. Images are resized to <math>56 \times 56</math> pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.<br />
<br />
Since scaling the <math>y_0</math> and <math>v_0</math> results in the same constraint loss <math>g</math>, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements. This method was used since the distance from the object to the camera could not be accurately recorded, and this distance is required to calculate the height in meters. This is not a bullet proof evaluation, and is discussed in further detail in the critique section. The results are compared to a supervised network trained with the labels to directly predict the height of the object in pixels. The supervised learning task is viewed as a substantially easier task. From this knowledge we can see from the table below that, under their evaluation criteria, the result performs well.<br />
<br />
==== Evaluation ====<br />
{| class="wikitable"<br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 12.1% || 94.5% || 90.1%<br />
|}<br />
<br />
=== Tracking the position of a walking man ===<br />
In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels <math>y \in \mathbb{R}</math> by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem <math>f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each training instances <math>x_i</math> are a vector of images, <math>\mathbf{x}</math>, being mapped to a sequence of predictions, <math>\mathbf{y}</math>. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.<br />
<br />
Given any sequence of correct outputs, <math>(\mathbf{y}_1, \ldots, \mathbf{y}_N)</math>, the modified sequence, <math>(\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)</math> (<math>\lambda, C \in \mathbb{R}</math>) will also satisfy the constant velocity constraint. In the worst case, when <math>\lambda = 0</math>, <math>f \equiv C</math>, and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.<br />
<br />
<center><math>h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))</math></center><br />
which seeks to maximize the standard deviation of the output, and<br />
<br />
<center><br />
<math>\begin{split}<br />
h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\<br />
& \text{max}(\text{ReLU}(0 - f(\mathbf{x})))<br />
\end{split}<br />
</math><br />
</center><br />
which limit the output to a fixed ranged <math>[0, 10]</math>, the final loss is thus:<br />
<br />
<center><br />
<math><br />
\begin{split}<br />
g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\<br />
& \gamma_1 * h_1(\mathbf{x}) <br />
\hphantom{\text{ }}+ \\<br />
& \gamma_2 * h_2(\mathbf{x})<br />
% h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\<br />
% & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y))<br />
\end{split}<br />
</math><br />
</center><br />
<br />
[[File:c433li-3.png|650px|center]]<br />
<br />
The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to <math>56 \times 56</math>. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to <math>\gamma_2 = 0.8</math> and the standard deviation bonus is set to <math>\gamma_1 = 0.6</math>.<br />
<br />
As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follow:<br />
==== Evaluation ====<br />
{| class="wikitable"<br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 45.9% || 80.5% || 95.4%<br />
|}<br />
Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available (as correlation on training data reached 99.8%).<br />
<br />
=== Detecting objects with causal relationships ===<br />
In the previous experiments, the authors explored options for incorporating constraints pertaining to dynamics equations in real-world phenomena, i.e., prior knowledge derived from elementary physics. In this experiment, the authors explore the possibilities of learning from logical constraints imposed on single images. More specifically, they ask whether it is possible to learn from causal phenomena.<br />
<br />
[[File:paper18_Experiment_3.png|400px|center]]<br />
<br />
Here, the authors provide images containing a stochastic collection of up to four characters: Peach, Mario, Yoshi, and Bowser, with each character having small appearance changes across frames due to rotation and reflection. Example images can be seen in Fig. (4). While the existence of objects in each frame is non-deterministic, the generating distribution encodes the underlying phenomenon that Mario will always appear whenever Peach appears. The aim is to create a pair of neural networks <math>f_1, f_2</math> for identifying Peach and Mario, respectively. The networks, <math>f_k : R^{height×width×3} → \{0, 1\}</math>, map the image to the discrete boolean variables, <math>y_1</math> and <math>y_2</math>. Rather than supervising with direct labels, the authors train the networks by constraining their outputs to have the logical relationship <math>y_1 ⇒ y_2</math>. This problem is challenging because the networks must simultaneously learn to recognize the characters and select them according to logical relationships. To avoid the trivial solution <math>y_1 \equiv 1, y_2 \equiv 1</math> on every image, three additional loss terms need to be added:<br />
<br />
<center><math> h_1(\mathbf{x}, k) = \frac{1}{M}\sum_i^M |Pr[f_k(\mathbf{x}) = 1] - Pr[f_k(\rho(\mathbf{x})) = 1]|, </math></center><br />
<br />
which forces rotational independence of the outputs in order to encourage the network to learn the existence, rather than location of objects, <br />
<br />
<center><math> h_2(\mathbf{x}, k) = -\text{std}_{i \in [1 \dots M]}(Pr[f_k(\mathbf{x}_i) = 1]), </math></center><br />
<br />
which seeks high variance outputs, and<br />
<br />
<center><br />
<math> h_3(\mathbf{x}, v) = \frac{1}{M}\sum_i^{M} (Pr[f(\mathbf{x}_i) = v] - \frac{1}{3} + (\frac{1}{3} - \mu_v))^2 \\<br />
\mu_{v} = \frac{1}{M}\sum_i^{M} \mathbb{1}\{v = \text{argmax}_{v' \in \{0, 1\}^2} Pr[f(\mathbf{x}) = v']\}. </math><br />
</center><br />
<br />
which seeks high entropy outputs. The final loss function then becomes: <br />
<br />
<center><br />
<math> \begin{split}<br />
g(\mathbf{x}) & = \mathbb{1}\{f_1(\mathbf{x}) \nRightarrow f_2(\mathbf{x})\} \hphantom{\text{ }} + \\<br />
& \sum_{k \in \{1, 2\}} \gamma_1 h_1(\mathbf{x}, k) + \gamma_2 h_2(\mathbf{x}, k) + <br />
\hspace{-0.7em} \sum_{v \neq \{1,0\}} \hspace{-0.7em} \gamma_3 * h_3(\mathbf{x}, v)<br />
\end{split}<br />
</math><br />
</center><br />
<br />
====Evaluation====<br />
<br />
The input images, shown in Figure 4, are 56 × 56 pixels. The authors used <math>\gamma_1 = 0.65, \gamma_2 = 0.65, \gamma_3 = 0.95</math>, and trained for 4,000 iterations. This experiment demonstrates that networks can learn from constraints that operate over discrete sets with potentially complex logical rules. Removing constraints will cause learning to fail. Thus, the experiment also shows that sophisticated sufficiency conditions can be key to success when learning from constraints.<br />
<br />
== Conclusion and Critique ==<br />
This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper is not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint, as in Deep Q learning (Mnih, Riedmiller et al. 2013[2]). In Deep Q-Learning (DQN) also uses a deep neural network which is trained with constraints just like this paper proposes.<br />
<center><math>Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}</math></center><br />
<br />
<br />
Also, the paper has a mistake where they quote the free fall equation as<br />
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2</math></center><br />
which should be<br />
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center><br />
Although in this case it doesn't affect the result.<br />
<br />
<br />
For the evaluation of the experiments, correlation with ground truth was used as the metric to avoid the fact that the output can be scaled without affecting the constraint loss, which is fine if the network gives output of the same scale. However, it is possible that, and the network may give output of varying scale for different inputs, in which case, we have no confidence that the network has learnt correctly, although the learnt outcome may be correlated with ground truth strongly. In fact, to solve the scaling issue, an better approach is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two losses.<br />
<br />
With regards to the free fall experiment in particular, the authors apply a fixed acceleration model to create the constraint loss, aiming to have the network predict height. However, since they did not measure the true height of the object to create test labels, they evaluate using height in pixel space. They do not mention the accuracy of their camera calibration, nor what camera model was used to remove lens distortion. Since lens distortion tends to be worse at the extreme edges of the image, and that they tossed the pillow throughout the entire frame, it is likely that the ground truth labels were corrupted by distortion. If that is the case, it is possible the supervised network is actually performing worse, because it learning how to predict distorted (beyond a constant scaling factor) heights instead of the true height.<br />
<br />
These methods essentially boil down to generating approximate labels for training data using some knowledge of the dynamic that the labels should follow.<br />
<br />
Finally, this paper only picks examples where the constraints are easy to design, while in some more common tasks such as image classification, what kind of constraints are needed is not straightforward at all.<br />
<br />
== References ==<br />
[1] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.<br />
<br />
[2] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arxiv 1312.5602.<br />
<br />
[3] “Russell91/Labelfree.” GitHub, github.com/russell91/labelfree.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36325Word translation without parallel data2018-04-18T21:24:58Z<p>Ws2chen: /* Results */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V.<br />
<br />
<math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity which is the cosine of the angle between two vectors. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Performance Analysis =<br />
<br />
== Experiments ==<br />
<br />
To illustrate the effectiveness of the methodology, the author demonstrated the unsupervised approach on several benchmarks, and compared it with state-of-the-art supervised methods to see if unsupervised model could do better job in terms of learning. The author firstly presented the cross-lingual evaluation tasks that are consider to evaluate the quality of our cross-lingual word embeddings. Then, presented the baseline model. Lastly, by comparing the unsupervised approach to baseline and to previous methods, tried to conclude with complementary analysis on the alignment of several sets of English embeddings trained with different methods and corpora.<br />
<br />
== Results ==<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space, 2013<br />
| arXiv:1301.3781<br />
<br />
<br />
Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. HLT-NAACL.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36324Word translation without parallel data2018-04-18T21:24:06Z<p>Ws2chen: /* Experiments */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V.<br />
<br />
<math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity which is the cosine of the angle between two vectors. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
== Experiments ==<br />
<br />
To illustrate the effectiveness of the methodology, the author demonstrated the unsupervised approach on several benchmarks, and compared it with state-of-the-art supervised methods to see if unsupervised model could do better job in terms of learning. The author firstly presented the cross-lingual evaluation tasks that are consider to evaluate the quality of our cross-lingual word embeddings. Then, presented the baseline model. Lastly, by comparing the unsupervised approach to baseline and to previous methods, tried to conclude with complementary analysis on the alignment of several sets of English embeddings trained with different methods and corpora.<br />
<br />
== Results ==<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space, 2013<br />
| arXiv:1301.3781<br />
<br />
<br />
Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. HLT-NAACL.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data&diff=36323Word translation without parallel data2018-04-18T21:22:44Z<p>Ws2chen: /* Results */</p>
<hr />
<div>[[File:Toy_example.png]]<br />
<br />
= Presented by =<br />
<br />
Xia Fan<br />
<br />
= Introduction =<br />
<br />
Many successful methods for learning relationships between languages stem from the hypothesis that there is a relationship between the context of words and their meanings. This means that if an adequate representation of a language is found in a high dimensional space (this is called an embedding), then words similar to a given word are close to one another in this space (ex. some norm can be minimized to find a word with similar context). Historically, another significant hypothesis is that these embedding spaces show similar structures over different languages. That is to say that given an embedding space for English and one for Spanish, a mapping could be found that aligns the two spaces and such a mapping could be used as a tool for translation. Many papers exploit these hypotheses, but use large parallel datasets for training. Recently, to remove the need for supervised training, methods have been implemented that utilize identical character strings (ex. letters or digits) in order to try to align the embeddings. The downside of this approach is that the two languages need to be similar to begin with as they need to have some shared basic building block. The method proposed in this paper uses an adversarial method to find this mapping between the embedding spaces of two languages without the use of large parallel datasets.<br />
<br />
The contributions of this paper can be listed as follows: <br />
<br />
1. This paper introduces a model that either is on par, or outperforms supervised state-of-the-art methods, without employing any cross-lingual annotated data such as bilingual dictionaries or parallel corpora (large and structured sets of texts). This method uses an idea similar to GANs: it leverages adversarial training to learn a linear mapping from a source to distinguish between the mapped source embeddings and the target embeddings, while the mapping is jointly trained to fool the discriminator. <br />
<br />
2. Second, this paper extracts a synthetic dictionary from the resulting shared embedding space and fine-tunes the mapping with the closed-form Procrustes solution from Schonemann (1966). <br />
<br />
3. Third, this paper also introduces an unsupervised selection metric that is highly correlated with the mapping quality and that the authors use both as a stopping criterion and to select the best hyper-parameters. <br />
<br />
4. Fourth, they introduce a cross-domain similarity adaptation to mitigate the so-called hubness problem (points tending to be nearest neighbors of many points in high-dimensional spaces).<br />
<br />
5. They demonstrate the effectiveness of our method using an example of a low-resource language pair where parallel corpora are not available (English-Esperanto) for which their method is particularly suited.<br />
<br />
This paper is published in ICLR 2018.<br />
<br />
= Related Work =<br />
<br />
'''Bilingual Lexicon Induction'''<br />
<br />
Many papers have addressed this subject by using discrete word representations. Regularly however these methods need to have an initialization of prior knowledge, such as the editing distance between the input and output ground truth. This unfortunately only works for closely related languages.<br />
<br />
= Model =<br />
<br />
<br />
=== Estimation of Word Representations in Vector Space ===<br />
<br />
This model focuses on learning a mapping between the two sets such that translations are close in the shared space. Before talking about the model it used, a model which can exploit the similarities of monolingual embedding spaces should be introduced. Mikolov et al.(2013) use a known dictionary of n=5000 pairs of words <math> \{x_i,y_i\}_{i\in{1,n}} </math>. and learn a linear mapping W between the source and the target space such that <br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F \hspace{1cm} (1)<br />
\end{align}<br />
<br />
where d is the dimension of the embeddings, <math> M_d(R) </math> is the space of d*d matrices of real numbers, and X and Y are two aligned matrices of size d*n containing the embeddings of the words in the parallel vocabulary. Here <math>||\cdot||_F</math> is the Frobenius matrix norm which is the square root of the sum of the squared components.<br />
<br />
Xing et al. (2015) showed that these results are improved by enforcing orthogonality constraint on W. In that case, equation (1) boils down to the Procrustes problem, a matrix approximation problem for which the goal is to find an orthogonal matrix that best maps two given matrices on the measure of the Frobenius norm. It advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of <math> YX^T </math> :<br />
<br />
\begin{align}<br />
W^*=argmin_{W{\in}M_d(R)}||WX-Y||_F=UV^T\textrm{, with }U\Sigma V^T=SVD(YX^T).<br />
\end{align}<br />
<br />
<br />
This can be proven as follows. First note that <br />
\begin{align}<br />
&||WX-Y||_F\\<br />
&= \langle WX-Y, WX-Y\rangle_F\\ <br />
&= \langle WX, WX \rangle_F -2 \langle W X, Y \rangle_F + \langle Y, Y \rangle_F \\<br />
&= ||X||_F^2 -2 \langle W X, Y \rangle_F + || Y||_F^2, <br />
\end{align}<br />
<br />
where <math display="inline"> \langle \cdot, \cdot \rangle_F </math> denotes the Frobenius inner-product and we have used the orthogonality of <math display="inline"> W </math>. It follows that we need only maximize the inner-product above. Let <math display="inline"> u_1, \ldots, u_d </math> denote the columns of <math display="inline"> U </math>. Let <math display="inline"> v_1, \ldots , v_d </math> denote the columns of <math display="inline"> V </math>. Let <math display="inline"> \sigma_1, \ldots, \sigma_d </math> denote the diagonal entries of <math display="inline"> \Sigma </math>. We have<br />
\begin{align}<br />
&\langle W X, Y \rangle_F \\<br />
&= \text{Tr} (W^T Y X^T)\\<br />
& =\text{Tr}(W^T \sum_i \sigma_i u_i v_i^T)\\<br />
&=\sum_i \sigma_i \text{Tr}(W^T u_i v_i^T)\\<br />
&=\sum_i \sigma_i ((Wv_i)^T u_i )\text{ invariance of trace under cyclic permutations}\\<br />
&\le \sum_i \sigma_i ||Wv_i|| ||u_i||\text{ Cauchy-Swarz inequality}\\<br />
&= \sum_i \sigma_i<br />
\end{align}<br />
where we have used the invariance of trace under cyclic permutations, Cauchy-Schwarz, and the orthogonality of the columns of U and V. Note that choosing <br />
\begin{align}<br />
W=UV^T<br />
\end{align}<br />
achieves the bound. This completes the proof.<br />
<br />
=== Domain-adversarial setting ===<br />
<br />
This paper shows how to learn this mapping W without cross-lingual supervision. An illustration of the approach is given in Fig. 1. First, this model learns an initial proxy of W by using an adversarial criterion. Then, it uses the words that match the best as anchor points for Procrustes. Finally, it improves performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense region. <br />
<br />
[[File:Toy_example.png |frame|none|alt=Alt text|Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y , which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) Finally, we translate by using the mapping W and a distance metric, dubbed CSLS, that expands the space where there is high density of points (like the area around the word “cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)).]]<br />
<br />
Let <math> X={x_1,...,x_n} </math> and <math> Y={y_1,...,y_m} </math> be two sets of n and m word embeddings coming from a source and a target language respectively. A model is trained is trained to discriminate between elements randomly sampled from <math> WX={Wx_1,...,Wx_n} </math> and Y, We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions. As a result, this is a two-player adversarial game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing so by making WX and Y as similar as possible. This approach is in line with the work of Ganin et al.(2016), who proposed to learn latent representations invariant to the input domain, where in this case, a domain is represented by a language(source or target).<br />
<br />
1. Discriminator objective<br />
<br />
Refer to the discriminator parameters as <math> \theta_D </math>. Consider the probability <math> P_{\theta_D}(source = 1|z) </math> that a vector z is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as:<br />
<br />
\begin{align}<br />
L_D(\theta_D|W)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=1|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=0|y_i)<br />
\end{align}<br />
<br />
2. Mapping objective <br />
<br />
In the unsupervised setting, W is now trained so that the discriminator is unable to accurately predict the embedding origins: <br />
<br />
\begin{align}<br />
L_W(W|\theta_D)=-\frac{1}{n} \sum_{i=1}^n log P_{\theta_D}(source=0|Wx_i)-\frac{1}{m} \sum_{i=1}^m log P_{\theta_D}(source=1|y_i)<br />
\end{align}<br />
<br />
3. Learning algorithm <br />
To train the model, the authors follow the standard training procedure of deep adversarial networks of Goodfellow et al. (2014). For every input sample, the discriminator and the mapping matrix W are trained successively with stochastic gradient updates to respectively minimize <math> L_D </math> and <math> L_W </math><br />
<br />
=== Refinement procedure ===<br />
<br />
The matrix W obtained with adversarial training gives good performance (see Table 1), but the results are still not on par with the supervised approach. In fact, the adversarial approach tries to align all words irrespective of their frequencies. However, rare words have embeddings that are less updated and are more likely to appear in different contexts in each corpus, which makes them harder to align. Under the assumption that the mapping is linear, it is then better to infer the global mapping using only the most frequent words as anchors. Besides, the accuracy on the most frequent word pairs is high after adversarial training.<br />
To refine the mapping, this paper build a synthetic parallel vocabulary using the W just learned with adversarial training. Specifically, this paper consider the most frequent words and retain only mutual nearest neighbors to ensure a high-quality dictionary. Subsequently, this paper apply the Procrustes solution in (2) on this generated dictionary. Considering the improved solution generated with the Procrustes algorithm, it is possible to generate a more accurate dictionary and apply this method iteratively, similarly to Artetxe et al. (2017). However, given that the synthetic dictionary obtained using adversarial training is already strong, this paper only observe small improvements when doing more than one iteration, i.e., the improvements on the word translation task are usually below 1%.<br />
<br />
=== Cross-Domain Similarity Local Scaling (CSLS) ===<br />
<br />
This paper considers a bi-partite neighborhood graph, in which each word of a given dictionary is connected to its K nearest neighbors in the other language. a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V.<br />
<br />
<math> N_T(Wx_s) </math> is used to denote the neighborhood, on this bi-partite graph, associated with a mapped source word embedding <math> Wx_s </math>. All K elements of <math> N_T(Wx_s) </math> are words from the target language. Similarly we denote by <math> N_S(y_t) </math> the neighborhood associated with a word t of the target language. Consider the mean similarity of a source embedding <math> x_s </math> to its target neighborhood as<br />
<br />
\begin{align}<br />
r_T(Wx_s)=\frac{1}{K}\sum_{y\in N_T(Wx_s)}cos(Wx_s,y_t)<br />
\end{align}<br />
<br />
where cos(.,.) is the cosine similarity which is the cosine of the angle between two vectors. Likewise, the mean similarity of a target word <math> y_t </math> to its neighborhood is denoted as <math> r_S(y_t) </math>. This is used to define similarity measure CSLS(.,.) between mapped source words and target words as <br />
<br />
\begin{align}<br />
CSLS(Wx_s,y_t)=2cos(Wx_s,y_t)-r_T(Wx_s)-r_S(y_t)<br />
\end{align}<br />
<br />
This process increases the similarity associated with isolated word vectors, but decreases the similarity of vectors lying in dense areas. <br />
<br />
CSLS represents an improved measure for producing reliable matching words between two languages (i.e. neighbors of a word in one language should ideally correspond to the same words in the second language). The nearest neighbors algorithm is asymmetric, and in high-dimensional spaces, it suffers from the problem of hubness, in which some points are nearest neighbors to exceptionally many points, while others are not nearest neighbors to any points. Existing approaches for combating the effect of hubness on word translation retrieval involve performing similarity updates one language at a time without consideration for the other language in the pair (Dinu et al., 2015, Smith et al., 2017). Consequently, they yielded less accurate results when compared to CSLS in experiments conducted in this paper (Table 1).<br />
<br />
= Training and architectural choices =<br />
=== Architecture ===<br />
<br />
This paper use unsupervised word vectors that were trained using fastText2. These correspond to monolingual embeddings of dimension 300 trained on Wikipedia corpora; therefore, the mapping W has size 300 × 300. Words are lower-cased, and those that appear less than 5 times are discarded for training. As a post-processing step, only the first 200k most frequent words were selected in the experiments.<br />
For the discriminator, it use a multilayer perceptron with two hidden layers of size 2048, and Leaky-ReLU activation functions. The input to the discriminator is corrupted with dropout noise with a rate of 0.1. As suggested by Goodfellow (2016), a smoothing coefficient s = 0.2 is included in the discriminator predictions. This paper use stochastic gradient descent with a batch size of 32, a learning rate of 0.1 and a decay of 0.95 both for the discriminator and W . <br />
<br />
=== Discriminator inputs ===<br />
The embedding quality of rare words is generally not as good as the one of frequent words (Luong et al., 2013), and it is observed that feeding the discriminator with rare words had a small, but not negligible negative impact. As a result, this paper only feed the discriminator with the 50,000 most frequent words. At each training step, the word embeddings given to the discriminator are sampled uniformly. Sampling them according to the word frequency did not have any noticeable impact on the results.<br />
<br />
=== Orthogonality===<br />
In this work, the authors propose to use a simple update step to ensure that the matrix W stays close to an orthogonal matrix during training (Cisse et al. (2017)). Specifically, the following update rule on the matrix W is used:<br />
<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
<br />
where β = 0.01 is usually found to perform well. This method ensures that the matrix stays close to the manifold of orthogonal matrices after each update.<br />
<br />
This update rule can be justified as follows. Consider the function <br />
\begin{align}<br />
g: \mathbb{R}^{d\times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
g(W)= W^T W -I.<br />
\end{align}<br />
<br />
The derivative of g at W is is the linear map<br />
\begin{align}<br />
Dg[W]: \mathbb{R}^{d \times d} \to \mathbb{R}^{d \times d}<br />
\end{align}<br />
defined by<br />
\begin{align}<br />
Dg[W](H)= H^T W + W^T H.<br />
\end{align}<br />
<br />
The adjoint of this linear map is<br />
<br />
\begin{align}<br />
D^\ast g[W](H)= WH^T +WH.<br />
\end{align}<br />
<br />
Now consider the function f<br />
\begin{align}<br />
f: \mathbb{R}^{d \times d} \to \mathbb{R}<br />
\end{align}<br />
<br />
defined by<br />
<br />
\begin{align}<br />
f(W)=||g(W) ||_F^2=||W^TW -I ||_F^2.<br />
\end{align}<br />
<br />
f has gradient:<br />
\begin{align}<br />
\nabla f (W) = 2D^\ast g[W] (g(W ) ) =2W(W^TW-I) +2W(W^TW-I)=4W W^TW-4W.<br />
\end{align}<br />
or<br />
\begin{align}<br />
\nabla f (W) = \nabla||W^TW-I||_F = \nabla\text{Tr}(W^TW-I)(W^TW-I)=4(\nabla(W^TW-I))(W^TW-I)=4W(W^TW-I)\text{ (check derivative of trace function)}<br />
\end{align}<br />
<br />
Thus the update<br />
\begin{align}<br />
W \leftarrow (1+\beta)W-\beta(WW^T)W<br />
\end{align}<br />
amounts to a step in the direction opposite the gradient of f. That is, a step toward the set of orthogonal matrices.<br />
<br />
=== Dictionary generation ===<br />
The refinement step requires the generation of a new dictionary at each iteration. In order for the Procrustes solution to work well, it is best to apply it on correct word pairs. As a result, the CSLS method is used to select more accurate translation pairs in the dictionary. To further increase the quality of the dictionary, and ensure that W is learned from correct translation pairs, only mutual nearest neighbors were considered, i.e. pairs of words that are mutually nearest neighbors of each other according to CSLS. This significantly decreases the size of the generated dictionary, but improves its accuracy, as well as the overall performance.<br />
<br />
=== Validation criterion for unsupervised model selection ===<br />
<br />
This paper consider the 10k most frequent source words, and use CSLS to generate a translation for each of them, then compute the average cosine similarity between these deemed translations, and use this average as a validation metric. The choice of using the 10 thousand most frequent source words is requires more justification since we would expect those to be the best trained words and may not accurately represent the entire data set. Perhaps a k-fold cross validation approach should be used instead. Figure 2 below shows the correlation between the evaluation score and this unsupervised criterion (without stabilization by learning rate shrinkage)<br />
<br />
<br />
<br />
[[File:fig2_fan.png |frame|none|alt=Alt text|Figure 2: Unsupervised model selection.<br />
Correlation between the unsupervised validation criterion (black line) and actual word translation accuracy (blue line). In this particular experiment, the selected model is at epoch 10. Observe how the criterion is well correlated with translation accuracy.]]<br />
<br />
= Results =<br />
<br />
== Experiments ==<br />
<br />
To illustrate the effectiveness of the methodology, the author demonstrated the unsupervised approach on several benchmarks, and compare it with state-of-the-art supervised methods to see if unsupervised model can do better job in terms of learning. The author firstly presented the cross-lingual evaluation tasks that are consider to evaluate the quality of our cross-lingual word embeddings. Then, presented the baseline model. Lastly, by comparing the unsupervised approach to baseline and to previous methods, tried to conclude with complementary analysis on the alignment of several sets of English embeddings trained with different methods and corpora.<br />
<br />
The results on word translation retrieval using the bilingual dictionaries are presented in Table 1, and a comparison to previous work in shown in Table 2 where the unsupervised model significantly outperforms previous approaches. The results on the sentence translation retrieval task are presented in Table 3, and the cross-lingual word similarity task in Table 4. Finally, the results on word-by-word translation for English-Esperanto are presented in Table 5. The bilingual dictionary used here does not account for words with multiple meanings.<br />
<br />
[[File:table1_fan.png |frame|none|alt=Alt text|Table 1: Word translation retrieval P@1 for the released vocabularies in various language pairs. The authors consider 1,500 source test queries, and 200k target words for each language pair. The authors use fastText embeddings trained on Wikipedia. NN: nearest neighbors. ISF: inverted softmax. (’en’ is English, ’fr’ is French, ’de’ is German, ’ru’ is Russian, ’zh’ is classical Chinese and ’eo’ is Esperanto)]]<br />
<br />
<br />
[[File:table2_fan.png |frame|none|alt=Alt text|English-Italian word translation average precisions (@1, @5, @10) from 1.5k source word queries using 200k target words. Results marked with the symbol † are from Smith et al. (2017). Wiki means the embeddings were trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, as they only use numbers in their ini- tial parallel dictionary.]]<br />
<br />
[[File:table3_fan.png |frame|none|alt=Alt text|Table 3: English-Italian sentence translation retrieval. The authors report the average P@k from 2,000 source queries using 200,000 target sentences. The authors use the same embeddings as in Smith et al. (2017). Their results are marked with the symbol †.]]<br />
<br />
[[File:table4_fan.png |frame|none|alt=Alt text|Table 4: Cross-lingual wordsim task. NASARI<br />
(Camacho-Collados et al. (2016)) refers to the official SemEval2017 baseline. The authors report Pearson correlation.]]<br />
<br />
[[File:table5_fan.png |frame|none|alt=Alt text|Table 5: BLEU score on English-Esperanto.<br />
Although being a naive approach, word-by- word translation is enough to get a rough idea of the input sentence. The quality of the gener- ated dictionary has a significant impact on the BLEU score.]]<br />
<br />
[[File:paper9_fig3.png |frame|none|alt=Alt text|Figure 3: The paper also investigated the impact of monolingual embeddings. It was found that model from this paper can align embeddings obtained through different methods, but not embeddings obtained from different corpora, which explains the large performance increase in Table 2 due to the corpus change from WaCky to Wiki using CBOW embedding. This is conveyed in this figure which displays English to English world alignment accuracies with regard to word frequency. Perfect alignment is achieved using the same model and corpora (a). Also good alignment using different model and corpora, although CSLS consistently has better results (b). Worse results due to use of different corpora (c). Even worse results when both embedding model and corpora are different.]]<br />
<br />
= Conclusion =<br />
It is clear that one major downfall of this method when it actually comes to translation is the restriction that the two languages must have similar intrinsic structures to allow for the embeddings to align. However, given this assumption, this paper shows for the first time that one can align word embedding spaces without any cross-lingual supervision, i.e., solely based on unaligned datasets of each language, while reaching or outperforming the quality of previous supervised approaches in several cases. Using adversarial training, the model is able to initialize a linear mapping between a source and a target space, which is also used to produce a synthetic parallel dictionary. It is then possible to apply the same techniques proposed for supervised techniques, namely a Procrustean optimization.<br />
<br />
= Open source code =<br />
The source code for the paper is provided at the following Github link: https://github.com/facebookresearch/MUSE. The repository provides the source code as written in PyTorch by the authors of this paper.<br />
<br />
= Source =<br />
Dinu, Georgiana; Lazaridou, Angeliki; Baroni, Marco<br />
| Improving zero-shot learning by mitigating the hubness problem<br />
| arXiv:1412.6568<br />
<br />
Lample, Guillaume; Denoyer, Ludovic; Ranzato, Marc'Aurelio <br />
| Unsupervised Machine Translation Using Monolingual Corpora Only<br />
| arXiv: 1701.04087<br />
<br />
Smith, Samuel L; Turban, David HP; Hamblin, Steven; Hammerla, Nils Y<br />
| Offline bilingual word vectors, orthogonal transformations and the inverted softmax<br />
| arXiv:1702.03859<br />
<br />
Lample, G. (n.d.). Facebookresearch/MUSE. Retrieved March 25, 2018, from https://github.com/facebookresearch/MUSE<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean<br />
| Efficient Estimation of Word Representations in Vector Space, 2013<br />
| arXiv:1301.3781<br />
<br />
<br />
Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. HLT-NAACL.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=36318stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-04-18T16:39:45Z<p>Ws2chen: /* Theoretical Contribution */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often, we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example, and show the results of 1. ignoring the problem, 2. trying to recover the lost information, and 3. using AmbientGAN as a way to recover the true data distribution. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements. This paper is published in ICLR 2018.<br />
<br />
== Contributions ==<br />
The paper makes the following contributions: <br />
<br />
=== Theoretical Contribution ===<br />
The authors show that the distribution of measured images uniquely determines the distribution of original images. This implies that a pure Nash equilibrium for the GAN game must find a generative model that matches the true distribution. They show similar results for a dropout measurement model, where each pixel is set to zero with some probability p, and a random projection measurement model, where they observe the inner product of the image with a random Gaussian vector.<br />
<br />
Also, the author listed a few theorems to support assumptions satisfied under Gaussian-Projection, Convolve+Noise and Block-Pixels measurement models, thus showing that that we can recover the true underlying distribution with the AmbientGAN framework. For example, the Gaussian theorem guarantees the uniqueness of underlying distribution. Finally by showing that this assumption is satisfied under Gaussian-Projection, Convolve+Noise and Block-Pixels measurement models, the author finally proved that can recover the true underlying distribution with the AmbientGAN framework.<br />
<br />
=== Empirical Contribution ===<br />
The authors consider CelebA and MNIST dataset for which the measurement model is unknown and show that Ambient GAN recovers a lot of the underlying structure.<br />
<br />
= Related Work = <br />
Currently there exist two distinct approaches for constructing neural network based generative models; they are autoregressive [4,5] and adversarial [6] based methods. The adversarial model has shown to be very successful in modeling complex data distributions such as images, 3D models, state action distributions and many more. This paper is related to the work in [7] where the authors create 3D object shapes from a dataset of 2D projections. This paper states that the work in [7] is a special case of the AmbientGAN framework where the measurement process creates 2D projections using weighted sums of voxel occupancies.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately, we do not observe any <math>X^g \sim p_x</math> so we cannot use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is, we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with respect to each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>.<br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset. The AmbientGAN is compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created using a Gaussian kernel and IID Gaussian noise, with <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
For the Keep-Patch measurement model, no pixels outside a box are known and thus inpainting methods are not suitable. For the Pad-Rotate-Project-θ measurements, a conventional technique is to sample many angles, and use techniques for inverting the Radon transform . However, since only a few projections are observed at a time, these methods aren’t readily applicable hence it is unclear how to obtain an approximate inverse function shown below. <br />
<br />
[[File:keep-patch.png]]<br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
The authors observe that the inception score peaks when the model uses the correct probability, but decreases smoothly as the probability moves away, demonstrating some robustness.<br />
<br />
=== Compressed Sensing ===<br />
<br />
As described in Bora et al. (2017), generative models were found to outperform sparsity-based approaches in sensing. Using this knowledge, the generator from AmbientGAN can be tested against Lasso to determine the required measurements to minimize the reconstruction error. As shown on the right of Figure 16, AmbientGAN outperforms Lasso in a fraction of the number of measurements<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. The authors show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. This allows for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known and that this <math>f_theta</math> is also differentiable. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data, or at the very least to remove the differentiability restriction from <math>f_theta</math>.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
=Open Source Code=<br />
An implementation of Ambient GAN can be found here: https://github.com/AshishBora/ambient-gan.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.<br />
# Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.<br />
# Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.<br />
# Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor- mation processing systems, pp. 2672–2680, 2014.<br />
# Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. arXiv preprint arXiv:1612.05872, 2016.<br />
# Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=36317stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-04-18T16:14:09Z<p>Ws2chen: /* Contributions */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often, we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example, and show the results of 1. ignoring the problem, 2. trying to recover the lost information, and 3. using AmbientGAN as a way to recover the true data distribution. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements. This paper is published in ICLR 2018.<br />
<br />
== Contributions ==<br />
The paper makes the following contributions: <br />
<br />
=== Theoretical Contribution ===<br />
The authors show that the distribution of measured images uniquely determines the distribution of original images. This implies that a pure Nash equilibrium for the GAN game must find a generative model that matches the true distribution. They show similar results for a dropout measurement model, where each pixel is set to zero with some probability p, and a random projection measurement model, where they observe the inner product of the image with a random Gaussian vector.<br />
<br />
Also, the author listed a few theorems to support assumptions satisfied under Gaussian-Projection, Convolve+Noise and Block-Pixels measurement models, thus showing that that we can recover the true underlying distribution with the AmbientGAN framework. <br />
<br />
=== Empirical Contribution ===<br />
The authors consider CelebA and MNIST dataset for which the measurement model is unknown and show that Ambient GAN recovers a lot of the underlying structure.<br />
<br />
= Related Work = <br />
Currently there exist two distinct approaches for constructing neural network based generative models; they are autoregressive [4,5] and adversarial [6] based methods. The adversarial model has shown to be very successful in modeling complex data distributions such as images, 3D models, state action distributions and many more. This paper is related to the work in [7] where the authors create 3D object shapes from a dataset of 2D projections. This paper states that the work in [7] is a special case of the AmbientGAN framework where the measurement process creates 2D projections using weighted sums of voxel occupancies.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately, we do not observe any <math>X^g \sim p_x</math> so we cannot use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is, we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with respect to each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>.<br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset. The AmbientGAN is compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created using a Gaussian kernel and IID Gaussian noise, with <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
For the Keep-Patch measurement model, no pixels outside a box are known and thus inpainting methods are not suitable. For the Pad-Rotate-Project-θ measurements, a conventional technique is to sample many angles, and use techniques for inverting the Radon transform . However, since only a few projections are observed at a time, these methods aren’t readily applicable hence it is unclear how to obtain an approximate inverse function shown below. <br />
<br />
[[File:keep-patch.png]]<br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
The authors observe that the inception score peaks when the model uses the correct probability, but decreases smoothly as the probability moves away, demonstrating some robustness.<br />
<br />
=== Compressed Sensing ===<br />
<br />
As described in Bora et al. (2017), generative models were found to outperform sparsity-based approaches in sensing. Using this knowledge, the generator from AmbientGAN can be tested against Lasso to determine the required measurements to minimize the reconstruction error. As shown on the right of Figure 16, AmbientGAN outperforms Lasso in a fraction of the number of measurements<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. The authors show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. This allows for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known and that this <math>f_theta</math> is also differentiable. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data, or at the very least to remove the differentiability restriction from <math>f_theta</math>.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
=Open Source Code=<br />
An implementation of Ambient GAN can be found here: https://github.com/AshishBora/ambient-gan.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.<br />
# Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.<br />
# Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.<br />
# Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor- mation processing systems, pp. 2672–2680, 2014.<br />
# Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. arXiv preprint arXiv:1612.05872, 2016.<br />
# Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements&diff=36316stat946w18/AmbientGAN: Generative Models from Lossy Measurements2018-04-18T15:39:22Z<p>Ws2chen: /* Theoretical Contribution */</p>
<hr />
<div>= Introduction =<br />
Generative Adversarial Networks operate by simulating complex distributions but training them requires access to large amounts of high quality data. Often, we only have access to noisy or partial observations, which will, from here on, be referred to as measurements of the true data. If we know the measurement function and would like to train a generative model for the true data, there are several ways to continue which have varying degrees of success. We will use noisy MNIST data as an illustrative example, and show the results of 1. ignoring the problem, 2. trying to recover the lost information, and 3. using AmbientGAN as a way to recover the true data distribution. Suppose we only see MNIST data that has been run through a Gaussian kernel (blurred) with some noise from a <math>N(0, 0.5^2)</math> distribution added to each pixel:<br />
<br />
<gallery mode="packed"><br />
File:mnist.png| True Data (Unobserved)<br />
File:mnistmeasured.png| Measured Data (Observed)<br />
</gallery><br />
<br />
<br />
=== Ignore the problem ===<br />
[[File:GANignore.png|500px]] [[File:mnistignore.png|300px]]<br />
<br />
Train a generative model directly on the measured data. This will obviously be unable to generate the true distribution before measurement has occurred. <br />
<br />
<br />
=== Try to recover the information lost ===<br />
[[File:GANrecovery.png|420px]] [[File:mnistrecover.png|300px]]<br />
<br />
Works better than ignoring the problem but depends on how easily the measurement function can be inverted.<br />
<br />
=== AmbientGAN ===<br />
[[File:GANambient.png|500px]] [[File:mnistambient.png|300px]]<br />
<br />
Ashish Bora, Eric Price and Alexandros G. Dimakis propose AmbientGAN as a way to recover the true underlying distribution from measurements of the true data. AmbientGAN works by training a generator which attempts to have the measurements of the output it generates fool the discriminator. The discriminator must distinguish between real and generated measurements. This paper is published in ICLR 2018.<br />
<br />
== Contributions ==<br />
The paper makes the following contributions: <br />
<br />
=== Theoretical Contribution ===<br />
The authors show that the distribution of measured images uniquely determines the distribution of original images. This implies that a pure Nash equilibrium for the GAN game must find a generative model that matches the true distribution. They show similar results for a dropout measurement model, where each pixel is set to zero with some probability p, and a random projection measurement model, where they observe the inner product of the image with a random Gaussian vector.<br />
<br />
Also, the author listed a few theorems to support assumptions satisfied under Gaussian-Projection, Convolve+Noise and Block-Pixels measurement models, thus showing that that we can recover the true underlying distribution with the AmbientGAN framework.<br />
<br />
=== Empirical Contribution ===<br />
The authors consider CelebA and MNIST dataset for which the measurement model is unknown and show that Ambient GAN recovers a lot of the underlying structure.<br />
<br />
= Related Work = <br />
Currently there exist two distinct approaches for constructing neural network based generative models; they are autoregressive [4,5] and adversarial [6] based methods. The adversarial model has shown to be very successful in modeling complex data distributions such as images, 3D models, state action distributions and many more. This paper is related to the work in [7] where the authors create 3D object shapes from a dataset of 2D projections. This paper states that the work in [7] is a special case of the AmbientGAN framework where the measurement process creates 2D projections using weighted sums of voxel occupancies.<br />
<br />
= Datasets and Model Architectures=<br />
We used three datasets for our experiments: MNIST, CelebA and CIFAR-10 datasets We briefly describe the generative models used for the experiments. For the MNIST dataset, we use two GAN models. The first model is a conditional DCGAN, while the second model is an unconditional Wasserstein GAN with gradient penalty (WGANGP). For the CelebA dataset, we use an unconditional DCGAN. For the CIFAR-10 dataset, we use an Auxiliary Classifier Wasserstein GAN with gradient penalty (ACWGANGP). For measurements with 2D outputs, i.e. Block-Pixels, Block-Patch, Keep-Patch, Extract-Patch, and Convolve+Noise, we use the same discriminator architectures as in the original work. For 1D projections, i.e. Pad-Rotate-Project, Pad-Rotate-Project-θ, we use fully connected discriminators. The architecture of the fully connected discriminator used for the MNIST dataset was 25-25-1 and for the CelebA dataset was 100-100-1.<br />
<br />
= Model =<br />
For the following variables superscript <math>r</math> represents the true distributions while superscript <math>g</math> represents the generated distributions. Let <math>x</math>, represent the underlying space and <math>y</math> for the measurement.<br />
<br />
Thus, <math>p_x^r</math> is the real underlying distribution over <math>\mathbb{R}^n</math> that we are interested in. However if we assume that our (known) measurement functions, <math>f_\theta: \mathbb{R}^n \to \mathbb{R}^m</math> are parameterized by <math>\Theta \sim p_\theta</math>, we can then observe <math>Y = f_\theta(x) \sim p_y^r</math> where <math>p_y^r</math> is a distribution over the measurements <math>y</math>.<br />
<br />
Mirroring the standard GAN setup we let <math>Z \in \mathbb{R}^k, Z \sim p_z</math> and <math>\Theta \sim p_\theta</math> be random variables coming from a distribution that is easy to sample. <br />
<br />
If we have a generator <math>G: \mathbb{R}^k \to \mathbb{R}^n</math> then we can generate <math>X^g = G(Z)</math> which has distribution <math>p_x^g</math> a measurement <math>Y^g = f_\Theta(G(Z))</math> which has distribution <math>p_y^g</math>. <br />
<br />
Unfortunately, we do not observe any <math>X^g \sim p_x</math> so we cannot use the discriminator directly on <math>G(Z)</math> to train the generator. Instead we will use the discriminator to distinguish between the <math>Y^g -<br />
f_\Theta(G(Z))</math> and <math>Y^r</math>. That is, we train the discriminator, <math>D: \mathbb{R}^m \to \mathbb{R}</math> to detect if a measurement came from <math>p_y^r</math> or <math>p_y^g</math>.<br />
<br />
AmbientGAN has the objective function:<br />
<br />
\begin{align}<br />
\min_G \max_D \mathbb{E}_{Y^r \sim p_y^r}[q(D(Y^r))] + \mathbb{E}_{Z \sim p_z, \Theta \sim p_\theta}[q(1 - D(f_\Theta(G(Z))))]<br />
\end{align}<br />
<br />
where <math>q(.)</math> is the quality function; for the standard GAN <math>q(x) = log(x)</math> and for Wasserstein GAN <math>q(x) = x</math>.<br />
<br />
As a technical limitation we require <math>f_\theta</math> to be differentiable with respect to each input for all values of <math>\theta</math>.<br />
<br />
With this set up we sample <math>Z \sim p_z</math>, <math>\Theta \sim p_\theta</math>, and <math>Y^r \sim U\{y_1, \cdots, y_s\}</math> each iteration and use them to compute the stochastic gradients of the objective function. We alternate between updating <math>G</math> and updating <math>D</math>.<br />
<br />
= Empirical Results =<br />
<br />
The paper continues to present results of AmbientGAN under various measurement functions when compared to baseline models. We have already seen one example in the introduction: a comparison of AmbientGAN in the Convolve + Noise Measurement case compared to the ignore-baseline, and the unmeasure-baseline. <br />
<br />
=== Convolve + Noise ===<br />
Additional results with the convolve + noise case with the celebA dataset. The AmbientGAN is compared to the baseline results with Wiener deconvolution. It is clear that AmbientGAN has superior performance in this case. The measurement is created using a Gaussian kernel and IID Gaussian noise, with <math>f_{\Theta}(x) = k*x + \Theta</math>, where <math>*</math> is the convolution operation, <math>k</math> is the convolution kernel, and <math>\Theta \sim p_{\theta}</math> is the noise distribution.<br />
<br />
[[File:paper7_fig3.png]]<br />
<br />
Images undergone convolve + noise transformations (left). Results with Wiener deconvolution (middle). Results with AmbientGAN (right).<br />
<br />
=== Block-Pixels ===<br />
With the block-pixels measurement function each pixel is independently set to 0 with probability <math>p</math>.<br />
<br />
[[File:block-pixels.png]]<br />
<br />
Measurements from the celebA dataset with <math>p=0.95</math> (left). Images generated from GAN trained on unmeasured (via blurring) data (middle). Results generated from AmbientGAN (right).<br />
<br />
=== Block-Patch ===<br />
<br />
[[File:block-patch.png]]<br />
<br />
A random 14x14 patch is set to zero (left). Unmeasured using-navier-stoke inpainting (middle). AmbientGAN (right). <br />
<br />
=== Pad-Rotate-Project-<math>\theta</math> ===<br />
<br />
[[File:pad-rotate-project-theta.png]]<br />
<br />
Results generated by AmbientGAN where the measurement function 0 pads the images, rotates it by <math>\theta</math>, and projects it on to the x axis. For each measurement the value of <math>\theta</math> is known. <br />
<br />
The generated images only have the basic features of a face and is referred to as a failure case in the paper. However the measurement function performs relatively well given how lossy the measurement function is. <br />
<br />
For the Keep-Patch measurement model, no pixels outside a box are known and thus inpainting methods are not suitable. For the Pad-Rotate-Project-θ measurements, a conventional technique is to sample many angles, and use techniques for inverting the Radon transform . However, since only a few projections are observed at a time, these methods aren’t readily applicable hence it is unclear how to obtain an approximate inverse function shown below. <br />
<br />
[[File:keep-patch.png]]<br />
<br />
=== Explanation of Inception Score ===<br />
To evaluate GAN performance, the authors make use of the inception score, a metric introduced by Salimans et al.(2016). To evaluate the inception score on a datapoint, a pre-trained inception classification model (Szegedy et al. 2016) is applied to that datapoint, and the KL divergence between its label distribution conditional on the datapoint and its marginal label distribution is computed. This KL divergence is the inception score. The idea is that meaningful images should be recognized by the inception model as belonging to some class, and so the conditional distribution should have low entropy, while the model should produce a variety of images, so the marginal should have high entropy. Thus an effective GAN should have a high inception score.<br />
<br />
=== MNIST Inception ===<br />
<br />
[[File:MNIST-inception.png]]<br />
<br />
AmbientGAN was compared with baselines through training several models with different probability <math>p</math> of blocking pixels. The plot on the left shows that the inception scores change as the block probability <math>p</math> changes. All four models are similar when no pixels are blocked <math>(p=0)</math>. By the increase of the blocking probability, AmbientGAN models present a relatively stable performance and perform better than the baseline models. Therefore, AmbientGAN is more robust than all other baseline models.<br />
<br />
The plot on the right reveals the changes in inception scores while the standard deviation of the additive Gaussian noise increased. Baselines perform better when the noise is small. By the increase of the variance, AmbientGAN models present a much better performance compare to the baseline models. Further AmbientGAN retains high inception scores as measurements become more and more lossy.<br />
<br />
For 1D projection, Pad-Rotate-Project model achieved an inception score of 4.18. Pad-Rotate-Project-θ model achieved an inception score of 8.12, which is close to the score of vanilla GAN 8.99.<br />
<br />
=== CIFAR-10 Inception ===<br />
<br />
[[File:CIFAR-inception.png]]<br />
<br />
AmbientGAN is faster to train and more robust even on more complex distributions such as CIFAR-10. Similar trends were observed on the CIFAR-10 data, and AmbientGAN maintains relatively stable inception score as the block probability was increased.<br />
<br />
=== Robustness To Measurement Model ===<br />
<br />
In order to empirically gauge robustness to measurement modelling error, the authors used the block-pixels measurement model: the image dataset was computed with <math> p^* = 0.5 </math>, and several versions of the model were trained, each using different values of blocking probability <math> p </math>. The inception scores were calculated and plotted as a function of <math> p </math>. This is shown on the left below:<br />
<br />
[[File:robustnessambientgan.png | 800px]]<br />
<br />
The authors observe that the inception score peaks when the model uses the correct probability, but decreases smoothly as the probability moves away, demonstrating some robustness.<br />
<br />
=== Compressed Sensing ===<br />
<br />
As described in Bora et al. (2017), generative models were found to outperform sparsity-based approaches in sensing. Using this knowledge, the generator from AmbientGAN can be tested against Lasso to determine the required measurements to minimize the reconstruction error. As shown on the right of Figure 16, AmbientGAN outperforms Lasso in a fraction of the number of measurements<br />
<br />
= Theoretical Results =<br />
<br />
The theoretical results in the paper prove the true underlying distribution of <math>p_x^r</math> can be recovered when we have data that comes from the Gaussian-Projection measurement, Fourier transform measurement and the block-pixels measurement. The do this by showing the distribution of the measurements <math>p_y^r</math> corresponds to a unique distribution <math>p_x^r</math>. Thus even when the measurement itself is non-invertible the effect of the measurement on the distribution <math>p_x^r</math> is invertible. Lemma 5.1 ensures this is sufficient to provide the AmbientGAN training process with a consistency guarantee. For full proofs of the results please see appendix A. <br />
<br />
=== Lemma 5.1 === <br />
Let <math>p_x^r</math> be the true data distribution, and <math>p_\theta</math> be the distributions over the parameters of the measurement function. Let <math>p_y^r</math> be the induced measurement distribution. <br />
<br />
Assume for <math>p_\theta</math> there is a unique probability distribution <math>p_x^r</math> that induces <math>p_y^r</math>. <br />
<br />
Then for the standard GAN model if the discriminator <math>D</math> is optimal such that <math>D(\cdot) = \frac{p_y^r(\cdot)}{p_y^r(\cdot) + p_y^g(\cdot)}</math>, then a generator <math>G</math> is optimal if and only if <math>p_x^g = p_x^r</math>. <br />
<br />
=== Theorems 5.2===<br />
For the Gussian-Projection measurement model, there is a unique underlying distribution <math>p_x^{r} </math> that can induce the observed measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.3===<br />
Let <math> \mathcal{F} (\cdot) </math> denote the Fourier transform and let <math>supp (\cdot) </math> be the support of a function. Consider the Convolve+Noise measurement model with the convolution kernel <math> k </math>and additive noise distribution <math>p_\theta </math>. If <math> supp( \mathcal{F} (k))^{c}=\phi </math> and <math> supp( \mathcal{F} (p_\theta))^{c}=\phi </math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>.<br />
<br />
=== Theorems 5.4===<br />
Assume that each image pixel takes values in a finite set P. Thus <math>x \in P^n \subset \mathbb{R}^{n} </math>. Assume <math>0 \in P </math>, and consider the Block-Pixels measurement model with <math>p </math> being the probability of blocking a pixel. If <math>p <1</math>, then there is a unique distribution <math>p_x^{r} </math> that can induce the measurement distribution <math>p_y^{r} </math>. Further, for any <math> \epsilon > 0, \delta \in (0, 1] </math>, given a dataset of<br />
\begin{equation}<br />
s=\Omega \left( \frac{|P|^{2n}}{(1-p)^{2n} \epsilon^{2}} log \left( \frac{|P|^{n}}{\delta} \right) \right)<br />
\end{equation}<br />
IID measurement samples from pry , if the discriminator D is optimal, then with probability <math> \geq 1 - \delta </math> over the dataset, any optimal generator G must satisfy <math> d_{TV} \left( p^g_x , p^r_x \right) \leq \epsilon </math>, where <math> d_{TV} \left( \cdot, \cdot \right) </math> is the total variation distance.<br />
<br />
= Conclusion =<br />
Generative models are powerful tools, but constructing a generative model requires a large, high quality dataset of the distribution of interest. The authors show how to relax this requirement, by learning a distribution from a dataset that only contains incomplete, noisy measurements of the distribution. This allows for the construction of new generative models of distributions for which no high quality dataset exists.<br />
<br />
= Future Research =<br />
<br />
One critical weakness of AmbientGAN is the assumption that the measurement model is known and that this <math>f_theta</math> is also differentiable. It would be nice to be able to train an AmbientGAN model when we have an unknown measurement model but also a small sample of unmeasured data, or at the very least to remove the differentiability restriction from <math>f_theta</math>.<br />
<br />
A related piece of work is [https://arxiv.org/abs/1802.01284 here]. In particular, Algorithm 2 in the paper excluding the discriminator is similar to AmbientGAN.<br />
<br />
=Open Source Code=<br />
An implementation of Ambient GAN can be found here: https://github.com/AshishBora/ambient-gan.<br />
<br />
= References =<br />
# https://openreview.net/forum?id=Hy7fDog0b<br />
# Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.<br />
# Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.<br />
# Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.<br />
# Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.<br />
# Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor- mation processing systems, pp. 2672–2680, 2014.<br />
# Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. arXiv preprint arXiv:1612.05872, 2016.<br />
# Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36315One-Shot Imitation Learning2018-04-18T14:48:08Z<p>Ws2chen: /* Temporal Dropout */</p>
<hr />
<div>= Introduction =<br />
Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time<br />
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discard a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36314stat946w18/Spectral normalization for generative adversial network2018-04-18T14:31:18Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\] Where \[f^{*} (x) = log q_{data} (x) - log p_{G} (x)\] <br />
<br />
Thus we will be having:<br />
<br />
\[\triangledown_{x} f^{*}(x) = \frac{1}{q_{data}(x)} \triangledown_{x}q_{data}(x) - \frac{1}{P_{G}(x)} \triangledown_{x}P_{G}(x)\]<br />
<br />
which can be unbounded or even incomputable. This allows us to introduce some regularity condition to the derivative of f(x).<br />
<br />
Now we can look back to the discriminator: we search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36313stat946w18/Spectral normalization for generative adversial network2018-04-18T14:29:43Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\] Where \[f^{*} (x) = log q_{data} (x) - log p_{G} (x)\] <br />
<br />
Thus we will be having:<br />
<br />
\[\triangledown_{x} f^{*}(x) = \frac{1}{q_{data}(x)} \triangledown_{x}q_{data}(x) - \frac{1}{G_{G}(x)} \triangledown_{x}P_{G}(x)\]<br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36312stat946w18/Spectral normalization for generative adversial network2018-04-18T14:23:06Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\] Where \[f^{*} (x) = log q_{data} (x) - log p_{G} (x)\] <br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36311stat946w18/Spectral normalization for generative adversial network2018-04-18T14:22:45Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\], where \[f^{*} (x) = log q_{data} (x) - log p_{G} (x),\] <br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36310stat946w18/Spectral normalization for generative adversial network2018-04-18T14:22:08Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\, where f^{*} (x) = log q_{data} (x) - log p_{G} (x),\] <br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36309stat946w18/Spectral normalization for generative adversial network2018-04-18T14:21:48Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\ , where f^{*} (x) = log q_{data} (x) - log p_{G} (x),\] <br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36308stat946w18/Spectral normalization for generative adversial network2018-04-18T14:21:19Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^{*} (x))\ , where f^{*} (x) = log q_{data} (x) - log p_{G} (x),\] <br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36307stat946w18/Spectral normalization for generative adversial network2018-04-18T14:20:18Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^(*) (x))\, where f^(*) (x) = log q_{data} (x) - log p_{G} (x),\] <br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network&diff=36306stat946w18/Spectral normalization for generative adversial network2018-04-18T14:19:21Z<p>Ws2chen: /* Model */</p>
<hr />
<div>= Presented by =<br />
<br />
1. liu, wenqing<br />
<br />
= Introduction =<br />
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been enjoying considerable success as a framework of generative models in recent years. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training.<br />
<br />
A persisting challenge in the training of GANs is the performance control of the discriminator. When the support of the model distribution and the support of target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). One such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of discriminator.<br />
<br />
In this paper, the authors propose a novel weight normalization method called ''spectral normalization'' that can stabilize the training of discriminator networks. The normalization enjoys following favorable properties:<br />
<br />
* The only hyper-parameter that needs to be tuned is the Lipschitz constant, and the algorithm is not too sensitive to this constant's value<br />
* The additional computational needed to implement spectral normalization is small <br />
<br />
In this study, they provide explanations of the effectiveness of spectral normalization against other regularization or normalization techniques.<br />
<br />
= Model =<br />
<br />
Let us consider a simple discriminator made of a neural network of the following form, with the input x:<br />
<br />
\[f(x,\theta) = W^{L+1}a_L(W^L(a_{L-1}(W^{L-1}(\cdots a_1(W^1x)\cdots))))\]<br />
<br />
where <math> \theta:=W^1,\cdots,W^L, W^{L+1} </math> is the learning parameters set, <math>W^l\in R^{d_l*d_{l-1}}, W^{L+1}\in R^{1*d_L} </math>, and <math>a_l </math> is an element-wise non-linear activation function.The final output of the discriminator function is given by <math>D(x,\theta) = A(f(x,\theta)) </math>. The standard formulation of GANs is given by <math>\min_{G}\max_{D}V(G,D)</math> where min and max of G and D are taken over the set of generator and discriminator functions, respectively. <br />
<br />
The conventional form of <math>V(G,D) </math> is given by:<br />
<br />
\[E_{x\sim q_{data}}[\log D(x)] + E_{x'\sim p_G}[\log(1-D(x'))]\]<br />
<br />
where <math>q_{data}</math> is the data distribution and <math>p_G(x)</math> is the model generator distribution to be learned through the adversarial min-max optimization. It is known that, for a fixed generator G, the optimal discriminator for this form of <math>V(G,D) </math> is given by <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) \]<br />
<br />
Also, the machine learning community has pointed out recently that the function space from which the discriminators can affect the performance of GANs. A number of works advocate the importance of Lipschitz continuity in assuring the boundedness of statistics. One example is given below: <br />
<br />
\[ D_G^{*}(x):=q_{data}(x)/(q_{data}(x)+p_G(x)) = sigmoid(f^(*)(x))\, where f^(*)(x) = log q_{data}(x) - log p_{G})(x),\] <br />
<br />
We search for the discriminator D from the set of K-lipshitz continuous functions, that is, <br />
<br />
\[ \arg\max_{||f||_{Lip}\le k}V(G,D)\]<br />
<br />
where we mean by <math> ||f||_{lip}</math> the smallest value M such that <math> ||f(x)-f(x')||/||x-x'||\le M </math> for any x,x', with the norm being the <math> l_2 </math> norm.<br />
<br />
Our spectral normalization controls the Lipschitz constant of the discriminator function <math> f </math> by literally constraining the spectral norm of each layer <math> g: h_{in}\rightarrow h_{out}</math>. By definition, Lipschitz norm <math> ||g||_{Lip} </math> is equal to <math> \sup_h\sigma(\nabla g(h)) </math>, where <math> \sigma(A) </math> is the spectral norm of the matrix A, which is equivalent to the largest singular value of A. Therefore, for a linear layer <math> g(h)=Wh </math>, the norm is given by <math> ||g||_{Lip}=\sigma(W) </math>. Observing the following bound:<br />
<br />
\[ ||f||_{Lip}\le ||(h_L\rightarrow W^{L+1}h_{L})||_{Lip}*||a_{L}||_{Lip}*||(h_{L-1}\rightarrow W^{L}h_{L-1})||_{Lip}\cdots ||a_1||_{Lip}*||(h_0\rightarrow W^1h_0)||_{Lip}=\prod_{l=1}^{L+1}\sigma(W^l) *\prod_{l=1}^{L} ||a_l||_{Lip} \]<br />
<br />
Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint <math> \sigma(W)=1 </math>:<br />
<br />
\[ \bar{W_{SN}}:= W/\sigma(W) \]<br />
<br />
In summary, just like what weight normalization does, we reparameterize weight matrix <math> \bar{W_{SN}} </math> as <math> W/\sigma(W) </math> to fix the singular value of weight matrix. Now we can calculate the gradient of new parameter W by chain rule:<br />
<br />
\[ \frac{\partial V(G,D)}{\partial W} = \frac{\partial V(G,D)}{\partial \bar{W_{SN}}}*\frac{\partial \bar{W_{SN}}}{\partial W} \]<br />
<br />
\[ \frac{\partial \bar{W_{SN}}}{\partial W_{ij}} = \frac{1}{\sigma(W)}E_{ij}-\frac{1}{\sigma(W)^2}*\frac{\partial \sigma(W)}{\partial(W_{ij})}W=\frac{1}{\sigma(W)}E_{ij}-\frac{[u_1v_1^T]_{ij}}{\sigma(W)^2}W=\frac{1}{\sigma(W)}(E_{ij}-[u_1v_1^T]_{ij}\bar{W_{SN}}) \]<br />
<br />
where <math> E_{ij} </math> is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and <math> u_1, v_1</math> are respectively the first left and right singular vectors of W.<br />
<br />
To understand the above computation in more detail, note that <br />
\begin{align}<br />
\sigma(W)= \sup_{||u||=1, ||v||=1} \langle Wv, u \rangle = \sup_{||u||=1, ||v||=1} \text{trace} ( (uv^T)^T W).<br />
\end{align}<br />
By Theorem 4.4.2 in Lemaréchal and Hiriart-Urruty (1996), the sub-differential of a convex function defined as the the maximum of a set of differentiable convex functions over a compact index set is the convex hull of the gradients of the maximizing functions. Thus we have the sub-differential:<br />
<br />
\begin{align}<br />
\partial \sigma = \text{convex hull} \{ u v^T: u,v \text{ are left/right singular vectors associated with } \sigma(W) \}.<br />
\end{align}<br />
<br />
However, the authors assume that the maximum singular value of W has only one left and one right normalized singular vector. Thus <math> \sigma </math> is differentiable and <br />
\begin{align}<br />
\nabla_W \sigma(W) =u_1v_1^T,<br />
\end{align}<br />
which explains the above computation.<br />
<br />
= Spectral Normalization VS Other Regularization Techniques =<br />
<br />
The weight normalization introduced by Salimans & Kingma (2016) is a method that normalizes the <math> l_2 </math> norm of each row vector in the weight matrix. Mathematically it is equivalent to require the weight by the weight normalization <math> \bar{W_{WN}} </math>:<br />
<br />
<math> \sigma_1(\bar{W_{WN}})^2+\cdots+\sigma_T(\bar{W_{WN}})^2=d_0, \text{where } T=\min(d_i,d_0) </math> where <math> \sigma_t(A) </math> is a t-th singular value of matrix A. <br />
<br />
Note, if <math> \bar{W_{WN}} </math> is the weight normalized matrix of dimension <math> d_i*d_0 </math>, the norm <math> ||\bar{W_{WN}}h||_2 </math> for a fixed unit vector <math> h </math> is maximized at <math> ||\bar{W_{WN}}h||_2 \text{ when } \sigma_1(\bar{W_{WN}})=\sqrt{d_0} \text{ and } \sigma_t(\bar{W_{WN}})=0, t=2, \cdots, T </math> which means that <math> \bar{W_{WN}} </math> is of rank one. In order to retain as much norm of the input as possible and hence to make the discriminator more sensitive, one would hope to make the norm of <math> \bar{W_{WN}}h </math> large. For weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. <br />
<br />
Brock et al. (2016) introduced orthonormal regularization on each weight to stabilize the training of GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding the following term:<br />
<br />
<math> ||W^TW-I||^2_F </math><br />
<br />
While this seems to serve the same purpose as spectral normalization, orthonormal regularization is mathematically quite different from our spectral normalization because orthonormal regulation requires weight matrix to be orthogonal which coerce singular values to be one, therefore, the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one. On the other hand, spectral normalization only scales the spectrum so that its maximum will be one. <br />
<br />
Gulrajani et al. (2017) used gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant(i.e <math> ||\nabla_{\hat{x}} f ||_2 = 1 </math>) at discrete sets of points of the form <math> \hat{x}:=\epsilon \tilde{x} + (1-\epsilon)x </math> generated by interpolating a sample <math> \tilde{x} </math> from generative distribution and a sample <math> x </math> from the data distribution. This approach has an obvious weakness of being heavily dependent on the support of the current generative distribution. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of <math> ||\nabla_{\hat{x}} f ||_2 </math> requires one whole round of forward and backward propagation.<br />
<br />
= Experimental settings and results = <br />
== Objective function ==<br />
For all methods other than WGAN-GP, we use <br />
<math> V(G,D) := E_{x\sim q_{data}(x)}[\log D(x)] + E_{z\sim p(z)}[\log (1-D(G(z)))]</math><br />
to update D, for the updates of G, use <math> -E_{z\sim p(z)}[\log(D(G(z)))] </math>. Alternatively, test performance of the algorithm with so-called hinge loss, which is given by <br />
<math> V_D(\hat{G},D)= E_{x\sim q_{data}(x)}[\min(0,-1+D(x))] + E_{z\sim p(z)}[\min(0,-1-D(\hat{G}(z)))] </math>, <math> V_G(G,\hat{D})=-E_{z\sim p(z)}[\hat{D}(G(z))] </math><br />
<br />
For WGAN-GP, we choose <br />
<math> V(G,D):=E_{x\sim q_{data}}[D(x)]-E_{z\sim p(z)}[D(G(z))]- \lambda E_{\hat{x}\sim p(\hat{x})}[(||\nabla_{\hat{x}}D(\hat{x}||-1)^2)]</math><br />
<br />
== Optimization ==<br />
Adam optimizer: 6 settings in total, related to <br />
* <math> n_{dis} </math>, the number of updates of the discriminator per one update of Adam. <br />
* learning rate <math> \alpha </math><br />
* the first and second momentum parameters <math> \beta_1, \beta_2 </math> of Adam<br />
<br />
[[File:inception score.png]]<br />
<br />
[[File:FID score.png]]<br />
<br />
The above image show the inception core and FID score of with settings A-F, and table show the inception scores of the different methods with optimal settings on CIFAR-10 and STL-10 dataset.<br />
<br />
== Singular values analysis on the weights of the discriminator D ==<br />
[[File:singular value.png]]<br />
<br />
In above figure, we show the squared singular values of the weight matrices in the final discriminator D produced by each method using the parameter that yielded the best inception score. As we predicted before, the singular values of the first fifth layers trained with weight clipping and weight normalization concentrate on a few components. On the other hand, the singular values of the weight matrices in those layers trained with spectral normalization is more broadly distributed.<br />
<br />
== Training time ==<br />
On CIFAR-10, SN-GANs is slightly slower than weight normalization at 31 seconds for 100 generations compared to that of weighted normalization at 29 seconds for 100 generations as seen in figure 10. However, SN-GANs are significantly faster than WGAN-GP at 40 seconds for 100 generations as seen in figure 10. As we mentioned in section 3, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm. For STL-10, the computational time of SN-GANs is almost the same as vanilla GANs at approximately 61 seconds for 100 generations.<br />
<br />
[[File:trainingTime.png|center]]<br />
<br />
== Comparison between GN-GANs and orthonormal regularization ==<br />
[[File:comparison.png]]<br />
Above we explained in Section 3, orthonormal regularization is different from our method in that it destroys the spectral information and puts equal emphasis on all feature dimensions, including the ones that shall be weeded out in the training process. To see the extent of its possibly detrimental effect, we experimented by increasing the dimension of the feature space, especially at the final layer for which the training with our spectral normalization prefers relatively small feature space. Above figure shows the result of our experiments. As we predicted, the performance of the orthonormal regularization deteriorates as we increase the dimension of the feature maps at the final layer. SN-GANs, on the other hand, does not falter with this modification of the architecture.<br />
<br />
We also applied our method to the training of class conditional GANs on ILSVRC2012 dataset with 1000 classes, each consisting of approximately 1300 images, which we compressed to 128*128 pixels. GAN without normalization and GAN with layer normalization collapsed in the beginning of training and failed to produce any meaningful images. Above picture shows that the inception score of the orthonormal normalization plateaued around 20k iterations, while SN kept improving even afterward.<br />
<br />
[[File:sngan.jpg]]<br />
<br />
Samples generated by various networks trained on CIFAR10. Momentum and training rates are increasing to the right. We can see that for high learning rates and momentum the Wasserstein-GAN does not generate good images, while weight and spectral normalization generate good samples.<br />
<br />
= Algorithm of spectral normalization =<br />
To calculate the largest singular value of matrix <math> W </math> to implement spectral normalization, we appeal to power iterations. Algorithm is executed as follows:<br />
<br />
* Initialize <math>\tilde{u}_{l}\in R^{d_l} \text{for} l=1,\cdots,L </math> with a random vector (sampled from isotropic distribution) <br />
* For each update and each layer l:<br />
** Apply power iteration method to a unnormalized weight <math> W^l </math>:<br />
<br />
\begin{align}<br />
\tilde{v_l}\leftarrow (W^l)^T\tilde{u_l}/||(W^l)^T\tilde{u_l}||_2<br />
\end{align}<br />
<br />
\begin{align}<br />
\tilde{u_l}\leftarrow (W^l)^T\tilde{v_l}/||(W^l)^T\tilde{v_l}||<br />
\end{align}<br />
<br />
* Calculate <math> \bar{W_{SN}} </math> with the spectral norm :<br />
<br />
\begin{align}<br />
\bar{W_{SN}}(W^l)=W^l/\sigma(W^l)<br />
\end{align}<br />
<br />
where<br />
<br />
\begin{align}<br />
\sigma(W^l)=\tilde{u_l}^TW^l\tilde{v_l}<br />
\end{align}<br />
<br />
* Update <math>W^l </math> with SGD on mini-batch dataset <math> D_M </math> with a learning rate <math> \alpha </math><br />
<br />
<br />
\begin{align}<br />
W^l\leftarrow W^l-\alpha\nabla_{W^l}l(\bar{W_{SN}^l}(W^l),D_M)<br />
\end{align}<br />
<br />
<br />
== Conclusions ==<br />
This paper proposes spectral normalization as a stabilizer for the training of GANs. When spectral normalization is applied to GANs on image generation tasks, the generated examples are more diverse than when using conventional weight normalization and achieve better or comparative inception scores relative to previous studies. The method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. In future work, the authors would like to investigate how their method compares analytically to other methods, while also further comparing it empirically by conducting experiments with their algorithm on larger and more complex datasets.<br />
<br />
== Open Source Code ==<br />
The open source code for this paper can be found at https://github.com/pfnet-research/sngan_projection.<br />
<br />
== References ==<br />
# Lemaréchal, Claude, and J. B. Hiriart-Urruty. "Convex analysis and minimization algorithms I." Grundlehren der mathematischen Wissenschaften 305 (1996).</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only&diff=36305stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only2018-04-18T13:41:48Z<p>Ws2chen: /* De-noising Auto-encoder Loss */</p>
<hr />
<div><br />
[[File:MC_Translation_Example.png]]<br />
== Introduction ==<br />
Neural machine translation systems are usually trained on large corpora consisting of pairs of pre-translated sentences. The paper ''Unsupervised Machine Translation Using Monolingual Corpora Only'' by Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato proposes an unsupervised neural machine translation system, which can be trained without such parallel data.<br />
<br />
==Motivation==<br />
The authors offer two motivations for their work:<br />
# To translate between languages for which large parallel corpora does not exist<br />
# To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield<br />
<br />
<br />
=== Note: What is a corpus (plural corpora)? ===<br />
<br />
In linguistics, a corpus (plural corpora) or text corpus and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple language (multilingual corpus).<br />
<br />
== Overview of unsupervised translation system ==<br />
The unsupervised translation scheme has the following outline:<br />
* The word-vector embeddings of the source and target languages are aligned in an unsupervised manner.<br />
* Sentences from the source and target language are mapped to a common latent vector space by an encoder, and then mapped to probability distributions over sentences in the target or source language by a decoder.<br />
* A de-noising auto-encoder loss encourages the latent-space representations to be insensitive to noise.<br />
* An adversarial loss encourages the latent-space representations of source and target sentences to be indistinguishable from each other. It is intended that the latent-space representation of a sentence should reflect its meaning, and not the particular language in which it is expressed.<br />
* A reconstruction loss encourages the model to improve on the translation model of the previous epoch.<br />
<br />
This paper investigates whether it is possible to train a general machine translation system without any form of supervision whatsoever. Based on the assumption that there exists a monolingual corpus (explained earlier) on each language. This set up is interesting for two reasons. <br />
<br />
* First, this is applicable whenever we encounter a new language pair for which we have no annotation. <br />
<br />
* Second, it provides a strong lower bound performance on what any good semi-supervised approach is expected to yield.<br />
<br />
[[File:paper4_fig1.png|frame|none|alt=Alt text|A toy example of illustrating the training process which guides the design of the objective function. The key idea here is to build a common latent space between languages. On the left, the model is trained to reconstruct a sentence from a noisy version of it in the same language. x is the target, C(x) is the noisy input, <math> \hat{x} </math> is the reconstruction. On the right, the model is trained to reconstruct a sentence given the same sentence but in another language.]]<br />
<br />
==Notation==<br />
Let <math>S</math> denote the set of words in the source language, and let <math>T</math> denote the set of words in the target language. Let <math>H \subset \mathbb{R}^{n_H}</math> denote the latent vector space. Moreover, let <math>S'</math> and <math>T'</math> denote the sets of finite sequences of words in the source and target language, and let <math>H'</math> denote the set of finite sequences of vectors in the latent space. For any set X, elide measure-theoretic details and let <math>\mathcal{P}(X)</math> denote the set of probability distributions over X.<br />
<br />
==Word vector alignment ==<br />
<br />
Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages. By "alignment", I mean that their method maps words with related meanings to nearby vectors, regardless of the language of the words. Moreover, if two words are one another's literal translations, their word vectors tend to be mutual nearest neighbors. <br />
<br />
The underlying idea of the alignment scheme can be summarized as follows: methods like word2vec or GLoVe generate vectors for which there is a correspondence between semantics and geometry. If <math display="inline">f</math> maps English words to their corresponding vectors, we have the approximate equation<br />
\begin{align}<br />
f(\text{king}) -f(\text{man}) +f(\text{woman})\approx f(\text{queen}).<br />
\end{align}<br />
Furthermore, if <math display="inline">g</math> maps French words to their corresponding vectors, then <br />
\begin{align}<br />
g(\text{roi}) -g(\text{homme}) +g(\text{femme})\approx g(\text{reine}).<br />
\end{align}<br />
<br />
Thus if <math display="inline">W</math> maps the word vectors of English words to the word vectors of their French translations, we should expect <math display="inline">W</math> to be linear. As was observed by Mikolov et al. (2013), the problem of word-vector alignment then becomes a problem of learning the linear transformation that best aligns two point clouds, one from the source language and one from the target language. For more on the history of the word-vector alignment problem, see my CS698 project ([https://uwaterloo.ca/scholar/sites/ca.scholar/files/pa2forsy/files/project_dec_3_0.pdf link]).<br />
<br />
Conneau et al. (2017)'s word vector alignment scheme is unique in that it requires no parallel data, and uses only the shapes of the two word-vector point clouds to be aligned. I will not go into detail, but the heart of the method is a special GAN, in which only the discriminator is a neural network, and the generator is the map corresponding to an orthogonal matrix.<br />
<br />
This unsupervised alignment method is crucial to the translation scheme of the current paper. From now on we denote by <br />
<math display="inline">A: S' \cup T' \to \mathcal{Z}'</math> the function that maps a source- or target- language word sequence to the corresponding aligned word vector sequence.<br />
<br />
==Encoder ==<br />
The encoder <math display="inline">E </math> reads a sequence of word vectors <math display="inline">(z_1,\ldots, z_m) \in \mathcal{Z}'</math> and outputs a sequence of hidden states <math display="inline">(h_1,\ldots, h_m) \in H'</math> in the latent space. Crucially, because the word vectors of the two languages have been aligned, the same encoder can be applied to both. That is, to map a source sentence <math display="inline">x=(x_1,\ldots, x_M)\in S'</math> to the latent space, we compute <math display="inline">E(A(x))</math>, and to map a target sentence <math display="inline">y=(y_1,\ldots, y_K)\in T'</math> to the latent space, we compute <math display="inline">E(A(y))</math>.<br />
<br />
The encoder consists of two LSTMs, one of which reads the word-vector sequence in the forward direction, and one of which reads it in the backward direction. The hidden state sequence is generated by concatenating the hidden states produced by the forward and backward LSTMs at each word vector.<br />
<br />
==Decoder==<br />
<br />
The decoder is a mono-directional LSTM that accepts a sequence of hidden states <math display="inline">h=(h_1,\ldots, h_m) \in H'</math> from the latent space and a language <math display="inline">L \in \{S,T \}</math> and outputs a probability distribution over sentences in that language. We have<br />
<br />
\begin{align}<br />
D: H' \times \{S,T \} \to \mathcal{P}(S') \cup \mathcal{P}(T').<br />
\end{align}<br />
<br />
The decoder makes use of the attention mechanism of Bahdanau et al. (2014). To compute the probability of a given sentence <math display="inline">y=(y_1,\ldots,y_K)</math> , the LSTM processes the sentence one word at a time, accepting at step <math display="inline">k</math> the aligned word vector of the previous word in the sentence <math display="inline">A(y_{k-1})</math> and a context vector <math display="inline">c_k\in H</math> computed from the hidden sequence <math display="inline">h\in H'</math>, and outputting a probability distribution over possible next words. The LSTM is initiated with a special, language-specific start-of-sequence token. Otherwise, the decoder is does not depend on the language of the sentence it is producing. The context vector is computed as described by Bahdanau et al. (2014), where we let <math display="inline">l_{k}</math> denote the hidden state of the LSTM at step <math display="inline">k</math>, and where <math display="inline">U,W</math> are learnable weight matrices, and <math display="inline">v</math> is a learnable weight vector:<br />
\begin{align}<br />
c_k&= \sum_{m=1}^M \alpha_{k,m} h_m\\<br />
\alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'}) },\\<br />
e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m ).<br />
\end{align}<br />
<br />
<br />
By learning <math display="inline">U,W</math> and <math display="inline">v</math>, the decoder can learn to decide which vectors in the sequence <math display="inline">h</math> are relevant to computing which words in the output sentence.<br />
<br />
At step <math display="inline">k</math>, after receiving the context vector <math display="inline">c_k\in H</math> and the aligned word vector of the previous word in the sequence,<math display="inline">A(y_{k-1})</math>, the LSTM outputs a probability distribution over words, which should be interpreted as the distribution of the next word according to the decoder. The probability the decoder assigns to a sentence is then the product of the probabilities computed for each word in this manner.<br />
<br />
[[File:paper4_fig2.png|700px|]]<br />
<br />
==Overview of objective ==<br />
The objective function is the sum of:<br />
# The de-noising auto-encoder loss,<br />
# The translation loss,<br />
# The adversarial loss.<br />
I shall describe these in the following sections.<br />
<br />
==De-noising Auto-encoder Loss == <br />
A de-noising auto-encoder is a function optimized to map a corrupted sample from some dataset to the original un-corrupted sample. De-noising auto-encoders were introduced by Vincent et al. (2008), who provided numerous justifications, one of which is particularly illuminating. If we think of the dataset of interest as a thin manifold in a high-dimensional space, the corruption process is likely perturbed a datapoint off the manifold. To learn to restore the corrupted datapoint, the de-noising auto-encoder must learn the shape of the manifold.<br />
<br />
The reason why we need to de-noise is because: during the training process of an auto-encoder of sentences, if the sequence-to-sequence model is provided<br />
with an attention mechanism. Then without any constraint, the auto-encoder tempts to merely copy every input word one by one. Resulting in perfectly copy sequences of random words, suggesting that the model does not learn any useful structure in the data.<br />
<br />
Hill et al. (2016), used a de-noising auto-encoder to learn vectors representing sentences. They corrupted input sentences by randomly dropping and swapping words, and then trained a neural network to map the corrupted sentence to a vector, and then map the vector to the un-corrupted sentence. Interestingly, they found that sentence vectors learned this way were particularly effective when applied to tasks that involved generating paraphrases. This makes some sense: for a vector to be useful in restoring a corrupted sentence, it must capture something of the sentence's underlying meaning.<br />
<br />
The present paper uses the principal of de-noising auto-encoders to compute one of the terms in its loss function. In each iteration, a sentence is sampled from the source or target language, and a corruption process <math display="inline"> C</math> is applied to it. <math display="inline"> C</math> works by deleting each word in the sentence with probability <math display="inline">p_C</math> and applying to the sentence a permutation randomly selected from those that do not move words more than <math display="inline">k_C</math> spots from their original positions. The authors select <math display="inline">p_C=0.1</math> and <math display="inline">k_C=3</math>. The corrupted sentence is then mapped to the latent space using <math display="inline">E\circ A</math>. The loss is then the negative log probability of the original un-corrupted sentence according to the decoder <math display="inline">D</math> applied to the latent-space sequence.<br />
<br />
The explanation of Vincent et al. (2008) can help us understand this loss-function term: the de-noising auto-encoder loss forces the translation system to learn the shapes of the manifolds of the source and target languages.<br />
<br />
==Translation Loss==<br />
To compute the translation loss, we sample a sentence from one of the languages, translate it with the encoder and decoder of the previous epoch, and then corrupt its output with <math display="inline">C</math>. We then use the current encoder <math display="inline">E</math> to map the corrupted translation to a sequence <math display="inline">h \in H'</math> and the decoder <math display="inline">D</math> to map <math display="inline">h</math> to a probability distribution over sentences. The translation loss is the negative log probability the decoder assigns to the original uncorrupted sentence. <br />
<br />
It is interesting and useful to consider why this translation loss, which depends on the translation model of the previous iteration, should promote an improved translation model in the current iteration. One loose way to understand this is to think of the translator as a de-noising translator. We are given a sentence perturbed from the manifold of possible sentences from a given language both by the corruption process and by the poor quality of the translation. The model must learn to both project and translate. The technique employed here resembles that used by Sennrich et al. (2014), who trained a neural machine translation system using both parallel and monolingual data. To make use of the monolingual target-language data, they used an auxiliary model to translate it to the source language, then trained their model to reconstruct the original target-language data from the source-language translation. Sennrich et al. argued that training the model to reconstruct true data from synthetic data was more robust than the opposite approach. The authors of the present paper use similar reasoning.<br />
<br />
==Adversarial Loss ==<br />
The intuition underlying the latent space is that it should encode the meaning of a sentence in a language-independent way. Accordingly, the authors introduce an adversarial loss, to encourage latent-space vectors mapped from the source and target languages to be indistinguishable. Central to this adversarial loss is the discriminator <math display="inline">R:H' \to [0,1]</math>, which makes use of <math display="inline">r: H\to [0,1]</math> a three-layer fully-connected neural network with 1024 hidden units per layer. Given a sequence of latent-space vectors <math display="inline">h=(h_1,\ldots,h_m)\in H'</math> the discriminator assigns probability <math display="inline">R(h)=\prod_{i=1}^m r(h_i)</math> that they originated in the target space. Each iteration, the discriminator is trained to maximize the objective function<br />
<br />
\begin{align}<br />
I_T(q) \log (R(E(q))) +(1-I_T(q) )\log(1-R(E(q)))<br />
\end{align}<br />
<br />
where <math display="inline">q</math> is a randomly selected sentence, and <math display="inline">I_T(q)</math> is 1 when <math display="inline">q\in I_T</math> is from the source language and 0 if <math display="inline">q\in I_S</math><br />
<br />
The same term is added to the primary objective function, which the encoder and decoder are trained to minimize. The result is that the encoder and decoder learn to fool the discriminator by mapping sentences from the source and target language to similar sequences of latent-space vectors.<br />
<br />
<br />
The authors note that they make use of label smoothing, a technique recommended by Goodfellow (2016) for regularizing GANs, in which the objective described above is replaced by <br />
<br />
\begin{align}<br />
I_T(q)( (1-\alpha)\log (R(E(q))) +\alpha\log(1-R(E(q))) )+(1-I_T(q) ) ( (1-\beta) \log(1-R(E(q))) +\beta\log (R(E(q)) ))<br />
\end{align}<br />
for some small nonnegative values of <math display="inline">\alpha, \beta</math>, the idea being to prevent the discriminator from making extreme predictions. While one-sided label smoothing (<math display="inline">\beta = 0</math>) is generally recommended, the present model differs from a standard GAN in that it is symmetric, and hence two-sided label smoothing would appear more reasonable.<br />
<br />
<br />
It is interesting to observe that while the intuition justifying the use of the latent space suggests that the latent space representation of a sentence should be language-independent, this is not actually true: if two sentences are translations of one another, but have different lengths, their latent-space representations will necessarily be different, since a a sentence's latent space representation has the same length as the sentence itself.<br />
<br />
==Objective Function==<br />
<br />
Combining the above-described terms, we can write the overall objective function. Let <math display="inline">Q_S</math> denote the monolingual dataset for the source language, and let <math display="inline">Q_T</math> denote the monolingual dataset for the target language. Let <math display="inline">D_S:= D(\cdot, S)</math> and<math display="inline">D_T= D(\cdot, T)</math> (i.e. <math display="inline">D_S, D_T</math>) be the decoder restricted to the source or target language, respectively. Let <math display="inline"> M_S </math> and <math display="inline"> M_T </math> denote the target-to-source and source-to-target translation models of the previous epoch. Then our objective function is<br />
<br />
\begin{align}<br />
\mathcal{L}(D,E,R)=\text{T Translation Loss}+\text{T De-noising Loss} +\text{T Adversarial Loss} +\text{S Translation Loss} +\text{S De-noising Loss} +\text{S Adversarial Loss}\\<br />
\end{align}<br />
\begin{align}<br />
=\sum_{q\in Q_T}\left( -\log D_T \circ E \circ C \circ M _S(q) (q) -\log D_T \circ E \circ C (q) (q)+(1-\alpha)\log (R\circ E(q)) +\alpha\log(1-R\circ E(q)) \right)+\sum_{q\in Q_S}\left( -\log D_S \circ E \circ C \circ M_T (q) (q) -\log D_S \circ E \circ C (q) (q)+(1-\beta) \log(1-R \circ E(q)) +\beta\log (R\circ E(q) \right).<br />
\end{align}<br />
<br />
They alternate between iterations minimizing <math display="inline">\mathcal{L} </math> with respect to <math display="inline">E, D</math> and iterations maximizing with respect to <math display="inline">R</math>. ADAM is used for minimization, while RMSprop is used for maximization. After each epoch, M is updated so that <math display="inline">M_S=D_S \circ E</math> and <math display="inline">M_T=D_T \circ E</math>, after which <math display="inline"> M </math> is frozen until the next epoch.<br />
<br />
==Validation==<br />
The authors' aim is for their method to be completely unsupervised, so they do not use parallel corpora even for the selection of hyper-parameters. Instead, they validate by translating sentences to the other language and back, and comparing the resulting sentence with the original according to BLEU, a similarity metric frequently used in translation (Papineni et al. 2002).<br />
<br />
As justification, they show empirically that the score generated by applying BLEU on back-and-forth translation is correlated with applying BLEU using parallel corpora.<br />
[[File:paper4fig3.png]]<br />
<br />
==Experimental Procedure and Results==<br />
<br />
The authors test their method on four data sets. The first is from the English-French translation task of the Workshop on Machine Translation 2014 (WMT14). This data set consists of parallel data. The authors generate a monolingual English corpus by randomly sampling 15 million sentence pairs, and choosing only the English sentences. They then generate a French corpus by selecting the French sentences from those pairs that were not previous chosen. Importantly, this means that the monolingual data sets have no parallel sentences. The second data set is generated from the English-German translation task from WMT14 using the same procedure.<br />
<br />
The third and fourth data sets are generated from Multi30k data set, which consists of multilingual captions of various images. The images are discarded and the English, French, and German captions are used to generate monolingual data sets in the manner described above. These monolingual corpora are much smaller, consisting of 14500 sentences each.<br />
<br />
The unsupervised translation scheme performs well, though not as well as a supervised translation scheme. It converges after a small number of epochs. Besides supervised translation, the authors compare their method with three other baselines: "Word-by-Word" uses only the previously-discussed word-alignment scheme; "Word-Reordering" uses a simple LSTM based language model and a greedy algorithm to select a reordering of the words produced by "Word-by-Word". "Oracle Word Reordering" means the optimal reordering of the words produced by "Word-by-Word".<br />
<br />
The discriminator is a MLP with 3 hidden layers of size 1024, Leaky-ReLU activation functions and an output logistic unit. The encoder and the decoder are trained using Adam with<br />
a learning rate of 0.0003, and a mini-batch size of 32. The discriminator is trained using RMSProp with a learning rate of 0.0005.<br />
<br />
==Result Figures==<br />
[[File:MC_Translation Results.png]]<br />
[[File:MC_Translation_Convergence.png]]<br />
<br />
==Commentary==<br />
This paper's results are impressive: that it is even possible to translate between languages without parallel data suggests that languages are more similar than we might initially suspect, and that the method the authors present has, at least in part, discovered some common deep structure. As the authors point out, using no parallel data at all, their method is able to produce results comparable to those produced by neural machine translation methods trained on hundreds of thousands of a parallel sentences on the WMT dataset. On the other hand, the results they offer come with a few significant caveats.<br />
<br />
The first caveat is that the workhorse of the method is the unsupervised word-vector alignment scheme presented in Conneau et al. (2017) (that paper shares 3 authors with this one). As the ablation study reveals, without word-vector alignment, this method preforms extremely poorly. Moreover, word-by-word translation using word-vector alignment alone performs well, albeit not as well as this method. This suggests that the method of this paper mainly learns to perform (sometimes significant) corrections to word-by-word translations by reordering and occasional word substitution. Presumably, it does this by learning something of the natural structure of sentences in each of the two languages, so that it can correct the errors made by word-by-word translation.<br />
<br />
The second caveat is that the best results are attained translating between English and French, two very closely related languages, and the quality of translation between English and German, a slightly-less related pair, is significantly worse ( according to the ''Shorter Oxford English Dictionary'', 28.3 percent of the English vocabulary is French-derived, 28.2 percent is Latin-derived, and 25 percent is derived from Germanic languages. This probably understates the degree of correspondence between the French and English vocabularies, since French likely derives from Latin many of the same words English does. ). The authors do not report results with more distantly-related pairs, but it is reasonable to expect that performance would degrade significantly, for two reasons. Firstly, Conneau et al. (2017) shows that the word-alignment scheme performs much worse on more distant language pairs. This may be because there are more one-to-one correspondences between the words of closely related languages than there are between more distant languages. Secondly, because the same encoder is used to read sentences of both languages, the encoder cannot adapt to the unique word-order properties of either language. This would become a problem for language pairs with very different grammar. The authors suggest that their scheme could be a useful tool for translating between language pairs for which their are few parallel corpora. However, language pairs lacking parallel corpora are often (though not always) distantly related, and it is for such pairs that the performance of the present method likely suffers.<br />
<br />
<br />
<br />
<br />
The proposed method always beats Oracle Word Reordering on the Multi30k data set, but sometimes does not on the WMT data set. This may be because the WMT sentences are much more syntactically complex than the simple image captions of the Multi30k data set.<br />
<br />
The ablation study also reveals the importance of the corruption process <math display="inline">C</math>: the absence of <math display="inline">C</math> significantly degrades translation quality, though not as much as the absence of word-vector alignment. We can understand this in two related ways. First of all, if we view the model as learning to correct structural errors in word-by-word translations, then the corruption process introduces more errors of this kind, and so provides additional data upon which the model can train. Second, as Vincent et al. (2008) point out, de-noising auto-encoder training encourages a model to learn the structure of the manifold from which the data is drawn. By learning the structure of the source and target languages, the model can better correct the errors of word-by-word translation.<br />
<br />
[[File:MC_Alignment_Results.png|frame|none|alt=Alt text|From Conneau et al. (2017). The final row shows the performance of alignment method used in the present paper. Note the degradation in performance for more distant languages.]]<br />
<br />
[[File:MC_Translation_Ablation.png|frame|none|alt=Alt text|From the present paper. Results of an ablation study. Of note are the first, third, and forth rows, which demonstrate that while the translation component of the loss is relatively unimportant, the word vector alignment scheme and de-noising auto-encoder matter a great deal.]]<br />
<br />
==Future Work==<br />
The principal of performing unsupervised translation by starting with a rough but reasonable guess, and then improving it using knowledge of the structure of target language seems promising. Word by word translation using word-vector alignment works well for closely related languages like English and French, but is unlikely to work as well for more distant languages. For those languages, a better method for getting an initial guess is required.<br />
<br />
==References==<br />
#Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
#Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)<br />
# Dictionary, Shorter Oxford English. "Shorter Oxford english dictionary." (2007).<br />
#Goodfellow, Ian. "NIPS 2016 tutorial: Generative adversarial networks." arXiv preprint arXiv:1701.00160 (2016).<br />
# Hill, Felix, Kyunghyun Cho, and Anna Korhonen. "Learning distributed representations of sentences from unlabelled data." arXiv preprint arXiv:1602.03483 (2016).<br />
# Lample, Guillaume, Ludovic Denoyer, and Marc'Aurelio Ranzato. "Unsupervised Machine Translation Using Monolingual Corpora Only." arXiv preprint arXiv:1711.00043 (2017).<br />
#Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.<br />
# Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).<br />
#Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." arXiv preprint arXiv:1511.06709 (2015).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.<br />
# Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only&diff=36304stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only2018-04-18T13:24:54Z<p>Ws2chen: /* Overview of unsupervised translation system */</p>
<hr />
<div><br />
[[File:MC_Translation_Example.png]]<br />
== Introduction ==<br />
Neural machine translation systems are usually trained on large corpora consisting of pairs of pre-translated sentences. The paper ''Unsupervised Machine Translation Using Monolingual Corpora Only'' by Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato proposes an unsupervised neural machine translation system, which can be trained without such parallel data.<br />
<br />
==Motivation==<br />
The authors offer two motivations for their work:<br />
# To translate between languages for which large parallel corpora does not exist<br />
# To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield<br />
<br />
<br />
=== Note: What is a corpus (plural corpora)? ===<br />
<br />
In linguistics, a corpus (plural corpora) or text corpus and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple language (multilingual corpus).<br />
<br />
== Overview of unsupervised translation system ==<br />
The unsupervised translation scheme has the following outline:<br />
* The word-vector embeddings of the source and target languages are aligned in an unsupervised manner.<br />
* Sentences from the source and target language are mapped to a common latent vector space by an encoder, and then mapped to probability distributions over sentences in the target or source language by a decoder.<br />
* A de-noising auto-encoder loss encourages the latent-space representations to be insensitive to noise.<br />
* An adversarial loss encourages the latent-space representations of source and target sentences to be indistinguishable from each other. It is intended that the latent-space representation of a sentence should reflect its meaning, and not the particular language in which it is expressed.<br />
* A reconstruction loss encourages the model to improve on the translation model of the previous epoch.<br />
<br />
This paper investigates whether it is possible to train a general machine translation system without any form of supervision whatsoever. Based on the assumption that there exists a monolingual corpus (explained earlier) on each language. This set up is interesting for two reasons. <br />
<br />
* First, this is applicable whenever we encounter a new language pair for which we have no annotation. <br />
<br />
* Second, it provides a strong lower bound performance on what any good semi-supervised approach is expected to yield.<br />
<br />
[[File:paper4_fig1.png|frame|none|alt=Alt text|A toy example of illustrating the training process which guides the design of the objective function. The key idea here is to build a common latent space between languages. On the left, the model is trained to reconstruct a sentence from a noisy version of it in the same language. x is the target, C(x) is the noisy input, <math> \hat{x} </math> is the reconstruction. On the right, the model is trained to reconstruct a sentence given the same sentence but in another language.]]<br />
<br />
==Notation==<br />
Let <math>S</math> denote the set of words in the source language, and let <math>T</math> denote the set of words in the target language. Let <math>H \subset \mathbb{R}^{n_H}</math> denote the latent vector space. Moreover, let <math>S'</math> and <math>T'</math> denote the sets of finite sequences of words in the source and target language, and let <math>H'</math> denote the set of finite sequences of vectors in the latent space. For any set X, elide measure-theoretic details and let <math>\mathcal{P}(X)</math> denote the set of probability distributions over X.<br />
<br />
==Word vector alignment ==<br />
<br />
Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages. By "alignment", I mean that their method maps words with related meanings to nearby vectors, regardless of the language of the words. Moreover, if two words are one another's literal translations, their word vectors tend to be mutual nearest neighbors. <br />
<br />
The underlying idea of the alignment scheme can be summarized as follows: methods like word2vec or GLoVe generate vectors for which there is a correspondence between semantics and geometry. If <math display="inline">f</math> maps English words to their corresponding vectors, we have the approximate equation<br />
\begin{align}<br />
f(\text{king}) -f(\text{man}) +f(\text{woman})\approx f(\text{queen}).<br />
\end{align}<br />
Furthermore, if <math display="inline">g</math> maps French words to their corresponding vectors, then <br />
\begin{align}<br />
g(\text{roi}) -g(\text{homme}) +g(\text{femme})\approx g(\text{reine}).<br />
\end{align}<br />
<br />
Thus if <math display="inline">W</math> maps the word vectors of English words to the word vectors of their French translations, we should expect <math display="inline">W</math> to be linear. As was observed by Mikolov et al. (2013), the problem of word-vector alignment then becomes a problem of learning the linear transformation that best aligns two point clouds, one from the source language and one from the target language. For more on the history of the word-vector alignment problem, see my CS698 project ([https://uwaterloo.ca/scholar/sites/ca.scholar/files/pa2forsy/files/project_dec_3_0.pdf link]).<br />
<br />
Conneau et al. (2017)'s word vector alignment scheme is unique in that it requires no parallel data, and uses only the shapes of the two word-vector point clouds to be aligned. I will not go into detail, but the heart of the method is a special GAN, in which only the discriminator is a neural network, and the generator is the map corresponding to an orthogonal matrix.<br />
<br />
This unsupervised alignment method is crucial to the translation scheme of the current paper. From now on we denote by <br />
<math display="inline">A: S' \cup T' \to \mathcal{Z}'</math> the function that maps a source- or target- language word sequence to the corresponding aligned word vector sequence.<br />
<br />
==Encoder ==<br />
The encoder <math display="inline">E </math> reads a sequence of word vectors <math display="inline">(z_1,\ldots, z_m) \in \mathcal{Z}'</math> and outputs a sequence of hidden states <math display="inline">(h_1,\ldots, h_m) \in H'</math> in the latent space. Crucially, because the word vectors of the two languages have been aligned, the same encoder can be applied to both. That is, to map a source sentence <math display="inline">x=(x_1,\ldots, x_M)\in S'</math> to the latent space, we compute <math display="inline">E(A(x))</math>, and to map a target sentence <math display="inline">y=(y_1,\ldots, y_K)\in T'</math> to the latent space, we compute <math display="inline">E(A(y))</math>.<br />
<br />
The encoder consists of two LSTMs, one of which reads the word-vector sequence in the forward direction, and one of which reads it in the backward direction. The hidden state sequence is generated by concatenating the hidden states produced by the forward and backward LSTMs at each word vector.<br />
<br />
==Decoder==<br />
<br />
The decoder is a mono-directional LSTM that accepts a sequence of hidden states <math display="inline">h=(h_1,\ldots, h_m) \in H'</math> from the latent space and a language <math display="inline">L \in \{S,T \}</math> and outputs a probability distribution over sentences in that language. We have<br />
<br />
\begin{align}<br />
D: H' \times \{S,T \} \to \mathcal{P}(S') \cup \mathcal{P}(T').<br />
\end{align}<br />
<br />
The decoder makes use of the attention mechanism of Bahdanau et al. (2014). To compute the probability of a given sentence <math display="inline">y=(y_1,\ldots,y_K)</math> , the LSTM processes the sentence one word at a time, accepting at step <math display="inline">k</math> the aligned word vector of the previous word in the sentence <math display="inline">A(y_{k-1})</math> and a context vector <math display="inline">c_k\in H</math> computed from the hidden sequence <math display="inline">h\in H'</math>, and outputting a probability distribution over possible next words. The LSTM is initiated with a special, language-specific start-of-sequence token. Otherwise, the decoder is does not depend on the language of the sentence it is producing. The context vector is computed as described by Bahdanau et al. (2014), where we let <math display="inline">l_{k}</math> denote the hidden state of the LSTM at step <math display="inline">k</math>, and where <math display="inline">U,W</math> are learnable weight matrices, and <math display="inline">v</math> is a learnable weight vector:<br />
\begin{align}<br />
c_k&= \sum_{m=1}^M \alpha_{k,m} h_m\\<br />
\alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'}) },\\<br />
e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m ).<br />
\end{align}<br />
<br />
<br />
By learning <math display="inline">U,W</math> and <math display="inline">v</math>, the decoder can learn to decide which vectors in the sequence <math display="inline">h</math> are relevant to computing which words in the output sentence.<br />
<br />
At step <math display="inline">k</math>, after receiving the context vector <math display="inline">c_k\in H</math> and the aligned word vector of the previous word in the sequence,<math display="inline">A(y_{k-1})</math>, the LSTM outputs a probability distribution over words, which should be interpreted as the distribution of the next word according to the decoder. The probability the decoder assigns to a sentence is then the product of the probabilities computed for each word in this manner.<br />
<br />
[[File:paper4_fig2.png|700px|]]<br />
<br />
==Overview of objective ==<br />
The objective function is the sum of:<br />
# The de-noising auto-encoder loss,<br />
# The translation loss,<br />
# The adversarial loss.<br />
I shall describe these in the following sections.<br />
<br />
==De-noising Auto-encoder Loss == <br />
A de-noising auto-encoder is a function optimized to map a corrupted sample from some dataset to the original un-corrupted sample. De-noising auto-encoders were introduced by Vincent et al. (2008), who provided numerous justifications, one of which is particularly illuminating. If we think of the dataset of interest as a thin manifold in a high-dimensional space, the corruption process is likely perturbed a datapoint off the manifold. To learn to restore the corrupted datapoint, the de-noising auto-encoder must learn the shape of the manifold.<br />
<br />
Hill et al. (2016), used a de-noising auto-encoder to learn vectors representing sentences. They corrupted input sentences by randomly dropping and swapping words, and then trained a neural network to map the corrupted sentence to a vector, and then map the vector to the un-corrupted sentence. Interestingly, they found that sentence vectors learned this way were particularly effective when applied to tasks that involved generating paraphrases. This makes some sense: for a vector to be useful in restoring a corrupted sentence, it must capture something of the sentence's underlying meaning.<br />
<br />
The present paper uses the principal of de-noising auto-encoders to compute one of the terms in its loss function. In each iteration, a sentence is sampled from the source or target language, and a corruption process <math display="inline"> C</math> is applied to it. <math display="inline"> C</math> works by deleting each word in the sentence with probability <math display="inline">p_C</math> and applying to the sentence a permutation randomly selected from those that do not move words more than <math display="inline">k_C</math> spots from their original positions. The authors select <math display="inline">p_C=0.1</math> and <math display="inline">k_C=3</math>. The corrupted sentence is then mapped to the latent space using <math display="inline">E\circ A</math>. The loss is then the negative log probability of the original un-corrupted sentence according to the decoder <math display="inline">D</math> applied to the latent-space sequence.<br />
<br />
The explanation of Vincent et al. (2008) can help us understand this loss-function term: the de-noising auto-encoder loss forces the translation system to learn the shapes of the manifolds of the source and target languages.<br />
<br />
==Translation Loss==<br />
To compute the translation loss, we sample a sentence from one of the languages, translate it with the encoder and decoder of the previous epoch, and then corrupt its output with <math display="inline">C</math>. We then use the current encoder <math display="inline">E</math> to map the corrupted translation to a sequence <math display="inline">h \in H'</math> and the decoder <math display="inline">D</math> to map <math display="inline">h</math> to a probability distribution over sentences. The translation loss is the negative log probability the decoder assigns to the original uncorrupted sentence. <br />
<br />
It is interesting and useful to consider why this translation loss, which depends on the translation model of the previous iteration, should promote an improved translation model in the current iteration. One loose way to understand this is to think of the translator as a de-noising translator. We are given a sentence perturbed from the manifold of possible sentences from a given language both by the corruption process and by the poor quality of the translation. The model must learn to both project and translate. The technique employed here resembles that used by Sennrich et al. (2014), who trained a neural machine translation system using both parallel and monolingual data. To make use of the monolingual target-language data, they used an auxiliary model to translate it to the source language, then trained their model to reconstruct the original target-language data from the source-language translation. Sennrich et al. argued that training the model to reconstruct true data from synthetic data was more robust than the opposite approach. The authors of the present paper use similar reasoning.<br />
<br />
==Adversarial Loss ==<br />
The intuition underlying the latent space is that it should encode the meaning of a sentence in a language-independent way. Accordingly, the authors introduce an adversarial loss, to encourage latent-space vectors mapped from the source and target languages to be indistinguishable. Central to this adversarial loss is the discriminator <math display="inline">R:H' \to [0,1]</math>, which makes use of <math display="inline">r: H\to [0,1]</math> a three-layer fully-connected neural network with 1024 hidden units per layer. Given a sequence of latent-space vectors <math display="inline">h=(h_1,\ldots,h_m)\in H'</math> the discriminator assigns probability <math display="inline">R(h)=\prod_{i=1}^m r(h_i)</math> that they originated in the target space. Each iteration, the discriminator is trained to maximize the objective function<br />
<br />
\begin{align}<br />
I_T(q) \log (R(E(q))) +(1-I_T(q) )\log(1-R(E(q)))<br />
\end{align}<br />
<br />
where <math display="inline">q</math> is a randomly selected sentence, and <math display="inline">I_T(q)</math> is 1 when <math display="inline">q\in I_T</math> is from the source language and 0 if <math display="inline">q\in I_S</math><br />
<br />
The same term is added to the primary objective function, which the encoder and decoder are trained to minimize. The result is that the encoder and decoder learn to fool the discriminator by mapping sentences from the source and target language to similar sequences of latent-space vectors.<br />
<br />
<br />
The authors note that they make use of label smoothing, a technique recommended by Goodfellow (2016) for regularizing GANs, in which the objective described above is replaced by <br />
<br />
\begin{align}<br />
I_T(q)( (1-\alpha)\log (R(E(q))) +\alpha\log(1-R(E(q))) )+(1-I_T(q) ) ( (1-\beta) \log(1-R(E(q))) +\beta\log (R(E(q)) ))<br />
\end{align}<br />
for some small nonnegative values of <math display="inline">\alpha, \beta</math>, the idea being to prevent the discriminator from making extreme predictions. While one-sided label smoothing (<math display="inline">\beta = 0</math>) is generally recommended, the present model differs from a standard GAN in that it is symmetric, and hence two-sided label smoothing would appear more reasonable.<br />
<br />
<br />
It is interesting to observe that while the intuition justifying the use of the latent space suggests that the latent space representation of a sentence should be language-independent, this is not actually true: if two sentences are translations of one another, but have different lengths, their latent-space representations will necessarily be different, since a a sentence's latent space representation has the same length as the sentence itself.<br />
<br />
==Objective Function==<br />
<br />
Combining the above-described terms, we can write the overall objective function. Let <math display="inline">Q_S</math> denote the monolingual dataset for the source language, and let <math display="inline">Q_T</math> denote the monolingual dataset for the target language. Let <math display="inline">D_S:= D(\cdot, S)</math> and<math display="inline">D_T= D(\cdot, T)</math> (i.e. <math display="inline">D_S, D_T</math>) be the decoder restricted to the source or target language, respectively. Let <math display="inline"> M_S </math> and <math display="inline"> M_T </math> denote the target-to-source and source-to-target translation models of the previous epoch. Then our objective function is<br />
<br />
\begin{align}<br />
\mathcal{L}(D,E,R)=\text{T Translation Loss}+\text{T De-noising Loss} +\text{T Adversarial Loss} +\text{S Translation Loss} +\text{S De-noising Loss} +\text{S Adversarial Loss}\\<br />
\end{align}<br />
\begin{align}<br />
=\sum_{q\in Q_T}\left( -\log D_T \circ E \circ C \circ M _S(q) (q) -\log D_T \circ E \circ C (q) (q)+(1-\alpha)\log (R\circ E(q)) +\alpha\log(1-R\circ E(q)) \right)+\sum_{q\in Q_S}\left( -\log D_S \circ E \circ C \circ M_T (q) (q) -\log D_S \circ E \circ C (q) (q)+(1-\beta) \log(1-R \circ E(q)) +\beta\log (R\circ E(q) \right).<br />
\end{align}<br />
<br />
They alternate between iterations minimizing <math display="inline">\mathcal{L} </math> with respect to <math display="inline">E, D</math> and iterations maximizing with respect to <math display="inline">R</math>. ADAM is used for minimization, while RMSprop is used for maximization. After each epoch, M is updated so that <math display="inline">M_S=D_S \circ E</math> and <math display="inline">M_T=D_T \circ E</math>, after which <math display="inline"> M </math> is frozen until the next epoch.<br />
<br />
==Validation==<br />
The authors' aim is for their method to be completely unsupervised, so they do not use parallel corpora even for the selection of hyper-parameters. Instead, they validate by translating sentences to the other language and back, and comparing the resulting sentence with the original according to BLEU, a similarity metric frequently used in translation (Papineni et al. 2002).<br />
<br />
As justification, they show empirically that the score generated by applying BLEU on back-and-forth translation is correlated with applying BLEU using parallel corpora.<br />
[[File:paper4fig3.png]]<br />
<br />
==Experimental Procedure and Results==<br />
<br />
The authors test their method on four data sets. The first is from the English-French translation task of the Workshop on Machine Translation 2014 (WMT14). This data set consists of parallel data. The authors generate a monolingual English corpus by randomly sampling 15 million sentence pairs, and choosing only the English sentences. They then generate a French corpus by selecting the French sentences from those pairs that were not previous chosen. Importantly, this means that the monolingual data sets have no parallel sentences. The second data set is generated from the English-German translation task from WMT14 using the same procedure.<br />
<br />
The third and fourth data sets are generated from Multi30k data set, which consists of multilingual captions of various images. The images are discarded and the English, French, and German captions are used to generate monolingual data sets in the manner described above. These monolingual corpora are much smaller, consisting of 14500 sentences each.<br />
<br />
The unsupervised translation scheme performs well, though not as well as a supervised translation scheme. It converges after a small number of epochs. Besides supervised translation, the authors compare their method with three other baselines: "Word-by-Word" uses only the previously-discussed word-alignment scheme; "Word-Reordering" uses a simple LSTM based language model and a greedy algorithm to select a reordering of the words produced by "Word-by-Word". "Oracle Word Reordering" means the optimal reordering of the words produced by "Word-by-Word".<br />
<br />
The discriminator is a MLP with 3 hidden layers of size 1024, Leaky-ReLU activation functions and an output logistic unit. The encoder and the decoder are trained using Adam with<br />
a learning rate of 0.0003, and a mini-batch size of 32. The discriminator is trained using RMSProp with a learning rate of 0.0005.<br />
<br />
==Result Figures==<br />
[[File:MC_Translation Results.png]]<br />
[[File:MC_Translation_Convergence.png]]<br />
<br />
==Commentary==<br />
This paper's results are impressive: that it is even possible to translate between languages without parallel data suggests that languages are more similar than we might initially suspect, and that the method the authors present has, at least in part, discovered some common deep structure. As the authors point out, using no parallel data at all, their method is able to produce results comparable to those produced by neural machine translation methods trained on hundreds of thousands of a parallel sentences on the WMT dataset. On the other hand, the results they offer come with a few significant caveats.<br />
<br />
The first caveat is that the workhorse of the method is the unsupervised word-vector alignment scheme presented in Conneau et al. (2017) (that paper shares 3 authors with this one). As the ablation study reveals, without word-vector alignment, this method preforms extremely poorly. Moreover, word-by-word translation using word-vector alignment alone performs well, albeit not as well as this method. This suggests that the method of this paper mainly learns to perform (sometimes significant) corrections to word-by-word translations by reordering and occasional word substitution. Presumably, it does this by learning something of the natural structure of sentences in each of the two languages, so that it can correct the errors made by word-by-word translation.<br />
<br />
The second caveat is that the best results are attained translating between English and French, two very closely related languages, and the quality of translation between English and German, a slightly-less related pair, is significantly worse ( according to the ''Shorter Oxford English Dictionary'', 28.3 percent of the English vocabulary is French-derived, 28.2 percent is Latin-derived, and 25 percent is derived from Germanic languages. This probably understates the degree of correspondence between the French and English vocabularies, since French likely derives from Latin many of the same words English does. ). The authors do not report results with more distantly-related pairs, but it is reasonable to expect that performance would degrade significantly, for two reasons. Firstly, Conneau et al. (2017) shows that the word-alignment scheme performs much worse on more distant language pairs. This may be because there are more one-to-one correspondences between the words of closely related languages than there are between more distant languages. Secondly, because the same encoder is used to read sentences of both languages, the encoder cannot adapt to the unique word-order properties of either language. This would become a problem for language pairs with very different grammar. The authors suggest that their scheme could be a useful tool for translating between language pairs for which their are few parallel corpora. However, language pairs lacking parallel corpora are often (though not always) distantly related, and it is for such pairs that the performance of the present method likely suffers.<br />
<br />
<br />
<br />
<br />
The proposed method always beats Oracle Word Reordering on the Multi30k data set, but sometimes does not on the WMT data set. This may be because the WMT sentences are much more syntactically complex than the simple image captions of the Multi30k data set.<br />
<br />
The ablation study also reveals the importance of the corruption process <math display="inline">C</math>: the absence of <math display="inline">C</math> significantly degrades translation quality, though not as much as the absence of word-vector alignment. We can understand this in two related ways. First of all, if we view the model as learning to correct structural errors in word-by-word translations, then the corruption process introduces more errors of this kind, and so provides additional data upon which the model can train. Second, as Vincent et al. (2008) point out, de-noising auto-encoder training encourages a model to learn the structure of the manifold from which the data is drawn. By learning the structure of the source and target languages, the model can better correct the errors of word-by-word translation.<br />
<br />
[[File:MC_Alignment_Results.png|frame|none|alt=Alt text|From Conneau et al. (2017). The final row shows the performance of alignment method used in the present paper. Note the degradation in performance for more distant languages.]]<br />
<br />
[[File:MC_Translation_Ablation.png|frame|none|alt=Alt text|From the present paper. Results of an ablation study. Of note are the first, third, and forth rows, which demonstrate that while the translation component of the loss is relatively unimportant, the word vector alignment scheme and de-noising auto-encoder matter a great deal.]]<br />
<br />
==Future Work==<br />
The principal of performing unsupervised translation by starting with a rough but reasonable guess, and then improving it using knowledge of the structure of target language seems promising. Word by word translation using word-vector alignment works well for closely related languages like English and French, but is unlikely to work as well for more distant languages. For those languages, a better method for getting an initial guess is required.<br />
<br />
==References==<br />
#Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
#Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)<br />
# Dictionary, Shorter Oxford English. "Shorter Oxford english dictionary." (2007).<br />
#Goodfellow, Ian. "NIPS 2016 tutorial: Generative adversarial networks." arXiv preprint arXiv:1701.00160 (2016).<br />
# Hill, Felix, Kyunghyun Cho, and Anna Korhonen. "Learning distributed representations of sentences from unlabelled data." arXiv preprint arXiv:1602.03483 (2016).<br />
# Lample, Guillaume, Ludovic Denoyer, and Marc'Aurelio Ranzato. "Unsupervised Machine Translation Using Monolingual Corpora Only." arXiv preprint arXiv:1711.00043 (2017).<br />
#Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.<br />
# Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).<br />
#Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." arXiv preprint arXiv:1511.06709 (2015).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.<br />
# Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008.</div>Ws2chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only&diff=36303stat946w18/Unsupervised Machine Translation Using Monolingual Corpora Only2018-04-18T13:24:17Z<p>Ws2chen: /* Overview of unsupervised translation system */</p>
<hr />
<div><br />
[[File:MC_Translation_Example.png]]<br />
== Introduction ==<br />
Neural machine translation systems are usually trained on large corpora consisting of pairs of pre-translated sentences. The paper ''Unsupervised Machine Translation Using Monolingual Corpora Only'' by Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato proposes an unsupervised neural machine translation system, which can be trained without such parallel data.<br />
<br />
==Motivation==<br />
The authors offer two motivations for their work:<br />
# To translate between languages for which large parallel corpora does not exist<br />
# To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield<br />
<br />
<br />
=== Note: What is a corpus (plural corpora)? ===<br />
<br />
In linguistics, a corpus (plural corpora) or text corpus and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple language (multilingual corpus).<br />
<br />
== Overview of unsupervised translation system ==<br />
The unsupervised translation scheme has the following outline:<br />
* The word-vector embeddings of the source and target languages are aligned in an unsupervised manner.<br />
* Sentences from the source and target language are mapped to a common latent vector space by an encoder, and then mapped to probability distributions over sentences in the target or source language by a decoder.<br />
* A de-noising auto-encoder loss encourages the latent-space representations to be insensitive to noise.<br />
* An adversarial loss encourages the latent-space representations of source and target sentences to be indistinguishable from each other. It is intended that the latent-space representation of a sentence should reflect its meaning, and not the particular language in which it is expressed.<br />
* A reconstruction loss encourages the model to improve on the translation model of the previous epoch.<br />
<br />
This paper investigates whether it is possible to train a general machine translation system without any form of supervision whatsoever. Based on the assumption that there exists a monolingual corpus(will be explained later) on each language. This set up is interesting for two reasons. <br />
<br />
* First, this is applicable whenever we encounter a new language pair for which we have no annotation. <br />
<br />
* Second, it provides a strong lower bound performance on what any good semi-supervised approach is expected to yield.<br />
<br />
[[File:paper4_fig1.png|frame|none|alt=Alt text|A toy example of illustrating the training process which guides the design of the objective function. The key idea here is to build a common latent space between languages. On the left, the model is trained to reconstruct a sentence from a noisy version of it in the same language. x is the target, C(x) is the noisy input, <math> \hat{x} </math> is the reconstruction. On the right, the model is trained to reconstruct a sentence given the same sentence but in another language.]]<br />
<br />
==Notation==<br />
Let <math>S</math> denote the set of words in the source language, and let <math>T</math> denote the set of words in the target language. Let <math>H \subset \mathbb{R}^{n_H}</math> denote the latent vector space. Moreover, let <math>S'</math> and <math>T'</math> denote the sets of finite sequences of words in the source and target language, and let <math>H'</math> denote the set of finite sequences of vectors in the latent space. For any set X, elide measure-theoretic details and let <math>\mathcal{P}(X)</math> denote the set of probability distributions over X.<br />
<br />
==Word vector alignment ==<br />
<br />
Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages. By "alignment", I mean that their method maps words with related meanings to nearby vectors, regardless of the language of the words. Moreover, if two words are one another's literal translations, their word vectors tend to be mutual nearest neighbors. <br />
<br />
The underlying idea of the alignment scheme can be summarized as follows: methods like word2vec or GLoVe generate vectors for which there is a correspondence between semantics and geometry. If <math display="inline">f</math> maps English words to their corresponding vectors, we have the approximate equation<br />
\begin{align}<br />
f(\text{king}) -f(\text{man}) +f(\text{woman})\approx f(\text{queen}).<br />
\end{align}<br />
Furthermore, if <math display="inline">g</math> maps French words to their corresponding vectors, then <br />
\begin{align}<br />
g(\text{roi}) -g(\text{homme}) +g(\text{femme})\approx g(\text{reine}).<br />
\end{align}<br />
<br />
Thus if <math display="inline">W</math> maps the word vectors of English words to the word vectors of their French translations, we should expect <math display="inline">W</math> to be linear. As was observed by Mikolov et al. (2013), the problem of word-vector alignment then becomes a problem of learning the linear transformation that best aligns two point clouds, one from the source language and one from the target language. For more on the history of the word-vector alignment problem, see my CS698 project ([https://uwaterloo.ca/scholar/sites/ca.scholar/files/pa2forsy/files/project_dec_3_0.pdf link]).<br />
<br />
Conneau et al. (2017)'s word vector alignment scheme is unique in that it requires no parallel data, and uses only the shapes of the two word-vector point clouds to be aligned. I will not go into detail, but the heart of the method is a special GAN, in which only the discriminator is a neural network, and the generator is the map corresponding to an orthogonal matrix.<br />
<br />
This unsupervised alignment method is cr