http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Z43ma&feedformat=atomstatwiki - User contributions [US]2024-03-29T07:02:40ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=42172stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-12-01T00:27:15Z<p>Z43ma: Detail in conclusion and proposed method.</p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs rather than vector-based features and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. Deterministic pooling methods are efficient and simple, but can hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. The neighborhood approach is used in all the mentioned pooling methods due to its simplicity and efficiency. Nevertheless, the approach can cause edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
For further information on wavelets, follow this link to MathWorks' [https://www.mathworks.com/videos/understanding-wavelets-part-1-what-are-wavelets-121279.html Understanding Wavelets] video series.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth out or 'dilute' details in features.<br />
<br />
Pooling is often introduced within networks to ensure local invariance to prevent overfitting due to small transitional shifts within an image. Despite the effectiveness of traditional pooling methods such as max pooling introduce this translational invariance by discarding information using methods analogous to nearest neighbour interpolation. With the hope of providing a more organic way of pooling, the authors leverage all information within cells inputted within a pooling operation with the hope that the resulting dim-reduced features are able to contain information from all high level cells using various dot products.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* Manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* Pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. These pooling methods reduce input data dimensionality by taking the maximum value or the average value of specific areas and condense them into one single value. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': In general, when facing a new problem in which one would want to use a CNN, it is unintuitive to whether average or max-pooling is preferred. Notably, both techniques have significant drawbacks. Average pooling forces the network to consider low magnitude (and possibly irrelevant information) in constructing representations, while max pooling can force the network to ignore fundamental differences between neighbouring groups of pixels. To counteract this, mixed pooling probabilistically decides which to use during training / testing. It should be noted that, for training, it is only probabilistic in the forward pass. During back-propagation the network defaults to the earlier chosen method. Mixed pooling can be applied in 3 different ways.<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
with probability of activations within each region defined as follows:<br />
<br />
\begin{align}<br />
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
== Proposed Method ==<br />
<br />
The previously highlighted pooling methods use neighborhoods to subsample, almost identical to nearest neighbor interpolation.<br />
<br />
The proposed pooling method uses wavelets (i.e. small waves - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression. The authors say that this organic reduction therefore lessens the creation of jagged edges and other artifacts that may impede correct image classification.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
<br />
<br />
* Computational Complexity:<br />
Above experiments and implementations on wavelet pooling were more of a proof-of-concept rather than an optimized method. In terms of mathematical operations, the wavelet pooling method is the least computationally efficient compared to all other pooling methods mentioned above. Among all the methods, average pooling is the most efficient methods, max pooling and mix pooling are at a similar level while wavelet pooling is way more expensive to complete the calculation.<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
The authors' results confirm previous studies proving that no one pooling method is superior, but some perform better than others depending on the dataset and network structure Boureau et al. (2010); Lee et al. (2016). Furthermore, many networks alternate between different pooling methods to maximize the effectiveness of each method. [1]<br />
<br />
Future work and improvements in this area could be to vary the wavelet basis to explore which basis performs best for the pooling. Altering the upsampling and downsampling factors in the decomposition and reconstruction can lead to better image feature reductions outside of the 2x2 scale. Retention of the subbands we discard for the backpropagation could lead to higher accuracies and fewer errors. Improving the method of FTW we use could greatly increase computational efficiency. Finally, analyzing the structural similarity (SSIM) of wavelet pooling versus other methods could further prove the vitality of using the authors' approach. [1]<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
* Adding asymptotic notations to the computational complexity of the proposed algorithm would be meaningful, particularly since the given results are for a single/fixed input size (one image in forward propagation) and consequently are not generalizable. <br />
* They could have considered comparing against Fast Fourier Transform (FFT). Including a non wavelet form seems to be an obvious candidate for comparison<br />
* If they went beyond the 2x2 pooling window this would have further supported their method<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) The experiments are largely conducted with very small scale datasets. As a result, I am not sure if they are representative enough to show the performance difference between different pooling methods.<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) No comparison to non-wavelet methods. For example, one obvious comparison would have been to look at using a DCT or FFT transform where the output would discard high-frequency components (this can get very close to the wavelet idea!).<br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.<br />
<br />
<br />
== Revisions == <br />
<br />
*Two reviewers really liked the paper and one of them called it in the top 15% papers in the conference which supports the novelty and potential of the idea. One other reviewer, however, believed that this was not good enough to be accepted and the main reason for rejection was the linearity nature of wavelet(which was not convincingly described). <br />
<br />
*The main concern of two of the reviewers has been the size of the datasets that have been used to test the method and the authors have mentioned future works concerning bigger datasets to test the method.<br />
<br />
*The computational cost section had not been included in the paper initially and was added after one of the reviewer's concern. So, the other reviewers have not been curious about this and unfortunately, there is no comment on that from them. However, the description on the non-efficient implementation seemed to be satisfactory to the reviewer which resulted in being accepted. <br />
<br />
[https://openreview.net/forum?id=rkhlb8lCZ Revisions]<br />
<br />
At the end, if you are interested in implementing the method, they are willing to share their code but after making it efficient. So, maybe there will be another paper regarding less computational cost on larger datasets with a publishable code.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=42171stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-12-01T00:22:37Z<p>Z43ma: </p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs rather than vector-based features and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. Deterministic pooling methods are efficient and simple, but can hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. The neighborhood approach is used in all the mentioned pooling methods due to its simplicity and efficiency. Nevertheless, the approach can cause edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
For further information on wavelets, follow this link to MathWorks' [https://www.mathworks.com/videos/understanding-wavelets-part-1-what-are-wavelets-121279.html Understanding Wavelets] video series.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth out or 'dilute' details in features.<br />
<br />
Pooling is often introduced within networks to ensure local invariance to prevent overfitting due to small transitional shifts within an image. Despite the effectiveness of traditional pooling methods such as max pooling introduce this translational invariance by discarding information using methods analogous to nearest neighbour interpolation. With the hope of providing a more organic way of pooling, the authors leverage all information within cells inputted within a pooling operation with the hope that the resulting dim-reduced features are able to contain information from all high level cells using various dot products.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* Manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* Pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. These pooling methods reduce input data dimensionality by taking the maximum value or the average value of specific areas and condense them into one single value. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': In general, when facing a new problem in which one would want to use a CNN, it is unintuitive to whether average or max-pooling is preferred. Notably, both techniques have significant drawbacks. Average pooling forces the network to consider low magnitude (and possibly irrelevant information) in constructing representations, while max pooling can force the network to ignore fundamental differences between neighbouring groups of pixels. To counteract this, mixed pooling probabilistically decides which to use during training / testing. It should be noted that, for training, it is only probabilistic in the forward pass. During back-propagation the network defaults to the earlier chosen method. Mixed pooling can be applied in 3 different ways.<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
with probability of activations within each region defined as follows:<br />
<br />
\begin{align}<br />
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
== Proposed Method ==<br />
<br />
The previously highlighted pooling methods use neighborhoods to subsample, almost identical to nearest neighbor interpolation.<br />
<br />
The proposed pooling method uses wavelets (i.e. small waves - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression. The authors say that this organic reduction therefore lessens the creation of jagged edges and other artifacts that may impede correct image classification.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
<br />
<br />
* Computational Complexity:<br />
Above experiments and implementations on wavelet pooling were more of a proof-of-concept rather than an optimized method. In terms of mathematical operations, the wavelet pooling method is the least computationally efficient compared to all other pooling methods mentioned above. Among all the methods, average pooling is the most efficient methods, max pooling and mix pooling are at a similar level while wavelet pooling is way more expensive to complete the calculation.<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
* Adding asymptotic notations to the computational complexity of the proposed algorithm would be meaningful, particularly since the given results are for a single/fixed input size (one image in forward propagation) and consequently are not generalizable. <br />
* They could have considered comparing against Fast Fourier Transform (FFT). Including a non wavelet form seems to be an obvious candidate for comparison<br />
* If they went beyond the 2x2 pooling window this would have further supported their method<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) The experiments are largely conducted with very small scale datasets. As a result, I am not sure if they are representative enough to show the performance difference between different pooling methods.<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) No comparison to non-wavelet methods. For example, one obvious comparison would have been to look at using a DCT or FFT transform where the output would discard high-frequency components (this can get very close to the wavelet idea!).<br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.<br />
<br />
<br />
== Revisions == <br />
<br />
*Two reviewers really liked the paper and one of them called it in the top 15% papers in the conference which supports the novelty and potential of the idea. One other reviewer, however, believed that this was not good enough to be accepted and the main reason for rejection was the linearity nature of wavelet(which was not convincingly described). <br />
<br />
*The main concern of two of the reviewers has been the size of the datasets that have been used to test the method and the authors have mentioned future works concerning bigger datasets to test the method.<br />
<br />
*The computational cost section had not been included in the paper initially and was added after one of the reviewer's concern. So, the other reviewers have not been curious about this and unfortunately, there is no comment on that from them. However, the description on the non-efficient implementation seemed to be satisfactory to the reviewer which resulted in being accepted. <br />
<br />
[https://openreview.net/forum?id=rkhlb8lCZ Revisions]<br />
<br />
At the end, if you are interested in implementing the method, they are willing to share their code but after making it efficient. So, maybe there will be another paper regarding less computational cost on larger datasets with a publishable code.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=42170stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-12-01T00:19:43Z<p>Z43ma: grammer update</p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs rather than vector-based features and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. Deterministic pooling methods are efficient and simple, but can hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. The neighborhood approach is used in all the mentioned pooling methods due to its simplicity and efficiency. Nevertheless, the approach can cause edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
For further information on wavelets, follow this link to MathWorks' [https://www.mathworks.com/videos/understanding-wavelets-part-1-what-are-wavelets-121279.html Understanding Wavelets] video series.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth out or 'dilute' details in features.<br />
<br />
Pooling is often introduced within networks to ensure local invariance to prevent overfitting due to small transitional shifts within an image. Despite the effectiveness of traditional pooling methods such as max pooling introduce this translational invariance by discarding information using methods analogous to nearest neighbour interpolation. With the hope of providing a more organic way of pooling, the authors leverage all information within cells inputted within a pooling operation with the hope that the resulting dim-reduced features are able to contain information from all high level cells using various dot products.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* Manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* Pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. These pooling methods reduce input data dimensionality by taking the maximum value or the average value of specific areas and condense them into one single value. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': In general, when facing a new problem in which one would want to use a CNN, it is unintuitive to whether average or max-pooling is preferred. Notably, both techniques have significant drawbacks. Average pooling forces the network to consider low magnitude (and possibly irrelevant information) in constructing representations, while max pooling can force the network to ignore fundamental differences between neighbouring groups of pixels. To counteract this, mixed pooling probabilistically decides which to use during training / testing. It should be noted that, for training, it is only probabilistic in the forward pass. During back-propagation the network defaults to the earlier chosen method. Mixed pooling can be applied in 3 different ways.<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
with probability of activations within each region defined as follows:<br />
<br />
\begin{align}<br />
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
== Proposed Method ==<br />
<br />
The proposed pooling method uses wavelets (i.e. small waves - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression. The authors say that this organic reduction therefore lessens the creation of jagged edges and other artifacts that may impede correct image classification.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
<br />
<br />
* Computational Complexity:<br />
Above experiments and implementations on wavelet pooling were more of a proof-of-concept rather than an optimized method. In terms of mathematical operations, the wavelet pooling method is the least computationally efficient compared to all other pooling methods mentioned above. Among all the methods, average pooling is the most efficient methods, max pooling and mix pooling are at a similar level while wavelet pooling is way more expensive to complete the calculation.<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
* Adding asymptotic notations to the computational complexity of the proposed algorithm would be meaningful, particularly since the given results are for a single/fixed input size (one image in forward propagation) and consequently are not generalizable. <br />
* They could have considered comparing against Fast Fourier Transform (FFT). Including a non wavelet form seems to be an obvious candidate for comparison<br />
* If they went beyond the 2x2 pooling window this would have further supported their method<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) The experiments are largely conducted with very small scale datasets. As a result, I am not sure if they are representative enough to show the performance difference between different pooling methods.<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) No comparison to non-wavelet methods. For example, one obvious comparison would have been to look at using a DCT or FFT transform where the output would discard high-frequency components (this can get very close to the wavelet idea!).<br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.<br />
<br />
<br />
== Revisions == <br />
<br />
*Two reviewers really liked the paper and one of them called it in the top 15% papers in the conference which supports the novelty and potential of the idea. One other reviewer, however, believed that this was not good enough to be accepted and the main reason for rejection was the linearity nature of wavelet(which was not convincingly described). <br />
<br />
*The main concern of two of the reviewers has been the size of the datasets that have been used to test the method and the authors have mentioned future works concerning bigger datasets to test the method.<br />
<br />
*The computational cost section had not been included in the paper initially and was added after one of the reviewer's concern. So, the other reviewers have not been curious about this and unfortunately, there is no comment on that from them. However, the description on the non-efficient implementation seemed to be satisfactory to the reviewer which resulted in being accepted. <br />
<br />
[https://openreview.net/forum?id=rkhlb8lCZ Revisions]<br />
<br />
At the end, if you are interested in implementing the method, they are willing to share their code but after making it efficient. So, maybe there will be another paper regarding less computational cost on larger datasets with a publishable code.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding&diff=42169stat946w18/Towards Image Understanding From Deep Compression Without Decoding2018-12-01T00:17:54Z<p>Z43ma: </p>
<hr />
<div>Paper Title: Towards Image Understanding from Deep Compression Without Decoding - ICLR 2018<br />
<br />
Presented By: Aravind Ravi<br />
<br />
== Introduction ==<br />
Recent advances in the deep neural network (DNN) based image compression methods have shown potential improvements in image quality, savings in storage and bandwidth reduction. These methods leverage common neural network architectures such as convolutional autoencoders or recurrent neural networks to compress and reconstruct RGB images and outperform classical techniques such as JPEG2000 and BPG on perceptual metrics such as structural similarity index (SSIM) and multi-scale structural similarity index (MS-SSIM).<br />
<br />
These approaches encode an image <math>x </math> to some feature map (compressed representation), which is subsequently quantized to a set of symbols <math>z </math>. These symbols are then losslessly compressed to a bitstream, from which a decoder reconstructs an image <math>{\hat{x}} </math>, of the same dimensions as <math>x </math>.<br />
<br />
Learned compression algorithms have an advantage over engineering compression algorithms in that they can be much more easily adapted to specific domains. For example, a learned compression algorithm might be able to learn good performance on compressing medical images, without specifically tuning the algorithm.<br />
<br />
In this paper, the authors explore the idea of applying the learned representations to perform inference without reconstructing the compressed image. Specifically, instead of reconstructing an RGB image from the compressed representation and feeding it to a network for inference, the paper proposes to use a modified network that bypasses reconstruction of the RGB image.<br />
<br />
The rationale behind this approach is that the neural network architectures commonly used for learned compression (in particular the encoders) are similar to the ones commonly used for inference, and learned image encoders are hence, in principle, capable of extracting features relevant for inference tasks. The encoder might learn features relevant for inference purely by training on the compression task, and can be forced to learn these features by training on the compression and inference tasks jointly.<br />
<br />
The advantage of learning an encoder for image compression which produces compressed representation containing features relevant for inference is obvious in scenarios where images are transmitted (e.g. from a mobile device) before processing (e.g. in the cloud), as it saves reconstruction of the RGB image as well as part of the feature extraction and hence speeds up processing. A typical use case is a cloud photo storage application where every image is processed immediately upon upload for indexing and search purposes.<br />
<br />
Note: [https://en.wikipedia.org/wiki/Structural_similarity More Information on SSIM, MSSIM]<br />
<br />
== Intuition ==<br />
<br />
Compression techniques (something as common as zipping) are commonly used by us in day to day file handling tasks. Most often we use engineered compression techniques. Deep Neural Networks (DNNs) are nonlinear function approximators which act as feature extractors, extracting features from inputs (like images or sound files). These can be seen as learning based compression techniques as they can perform compression and they can be trained using back propagation as well. If image classification can be done on these compressed files, large image data sets like hyperspectral images and MRI images can be stored efficiently and the compressed files can be used directly by the DNNs for classification or reinforcement learning tasks.<br />
<br />
==Motivation and Contributions==<br />
The authors propose to perform image understanding tasks such as image classification and segmentation directly on DNN based compressed representations. Performing the image understanding tasks on the compressed representations/encoded feature maps has two advantages. <br />
# This method bypasses the process of decoding the image into the RGB space before classification.<br />
# The authors show that it reduces the overall computational complexity up to 2 times.<br />
<br />
=== Contributions of the Paper ===<br />
* A method to perform image classification and semantic segmentation from compressed representations. In large scale image understanding problems, learning from a compressed representation is definitely something that is interesting. <br />
* The proposed method offers classification accuracy similar to that achieved on decompressed images while reducing the computational complexity by 2 times.<br />
* Semantic segmentation has been shown to be as accurate as performance on decompressed images for moderate compression rates and higher accuracy for aggressive compression rates. In addition, this method achieves lower computational complexity.<br />
* Joint training for image compression and classification has been shown to improve the quality of the image and increase in accuracy of classification and segmentation.<br />
<br />
==Related Work==<br />
<br />
The prior work has shown image classification from compressed images based on engineered codecs. Some of the works in this area are:<br />
<br />
* In video analysis domain: Action recognition (Yeo et al., 2008; Kantorov & Laptev, 2014)<br />
* Classification of compressed hyperspectral images (Hahn et al., 2014; Aghagolzadeh & Radha, 2015)<br />
* Discrete Cosine Transform based compression performed on images before feeding into a neural network, which shows an improvement in training speed by up to 10 times Fu & Guimaraes (2016)<br />
* Video analysis on compressed video (using engineered codecs) has also been studied in the past (Babu et al., 2016)<br />
* Criticism on document image analysis methods (Javed et al.2017)<br />
<br />
The authors propose a method that does inference on top of learned feature representation and hence has a direct relation to unsupervised feature learning using autoencoders.<br />
They also claim that so far there hasn't been any work using learned compressed representations for image classification and segmentation.<br />
<br />
==Learned Deeply Compressed Representations==<br />
<br />
The image compression task is performed based on a convolutional autoencoder architecture proposed by Theis et al. 2017 (shown in the figure below), and a variant of the training procedure described by Agustsson et. al 2017. <br />
<br />
[[File:AR_theisAutoencoder.png|600px|center]]<br />
<br />
Some points to better understand the architecture:<br />
<br />
1. Most convolutions are done in a convolved, lower-dimensional space to speed up computation<br />
<br />
2. Different activation functions are used. Blank arrows indicate the identity function (no additional linearity), while black arrows indicate leaky rectifications<br />
<br />
3. The “round” box simply rounds all elements in the tensor to the nearest integer<br />
<br />
4. The “subpix” block is just an upsampling /reconstruction block where the feature map’s coefficients are reshuffled after a convolution<br />
<br />
<br />
<br />
=== Compression Architecture ===<br />
<br />
The compression network is an autoencoder that takes an input image <math>x </math> and outputs <math>{\hat{x}} </math> as the approximation to the input. <br />
<br />
[[File:AR_Fig2a.png|300px|center]]<br />
<br />
The encoder has the following structure: It starts with 2 convolutional layers with spatial subsampling by a factor of 2, followed by 3 residual units, and a final convolutional layer with spatial subsampling by a factor of 2. This results in a <math>w/8</math> x <math>h/8</math> x <math>C</math> dimensional representation, where <math>w </math> and <math>h </math> are the spatial dimensions of <math>x </math>, and the number of channels C is a hyperparameter related to the rate <math>R </math>. This representation is then quantized to a discrete set of symbols, forming a compressed representation, <math>z </math>.<br />
<br />
To get the reconstruction <math>{\hat{x}} </math>, the compressed representation is fed into the decoder, which mirrors the encoder, but uses upsampling and deconvolutions instead of subsampling and convolutions.<br />
<br />
Quantizing the compressed representation imposes a distortion <math>D </math> on <math>{\hat{x}} </math> w.r.t. <math>x </math>, i.e., it increases the reconstruction error. This is traded for a decrease in entropy of the quantized compressed representation<br />
<math>z </math> which leads to a decrease of the length of the bitstream as measured by the rate <math>R </math>. Thus, to train the image compression network, the classical rate-distortion trade-off <math>D + \beta R</math> is minimized. As a metric for <math>D </math>, the mean squared error (MSE) between <math>x </math> and <math>{\hat{x}} </math> are used and <math>R</math> is estimated using<br />
<math>H(q)</math>. <math>H(q)</math> is the entropy of the probability distribution over the symbols and is estimated using a histogram of the probability distribution (as done by Agustsson et al., 2017). The trade-off between MSE and the entropy is controlled by adjusting <math>\beta </math>. For each <math>\beta </math> an operating point is derived where the images have a certain bit rate, as measured by bits per pixel (bpp), and corresponding MSE. To better control the bpp, a target entropy Ht is introduced by the authors to formulate the loss defined as:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)<br />
\end{align}<br />
<br />
Agustsson et. al 2017, proposed a method to overcome the issue of non-differentiability of the quantization step by proposing a differentiable approximation to the quantization. This method has been adapted to suit the current application in the paper.<br />
<br />
Three operating points at 0.0983 bpp (C=8), 0.330 bpp (C=16), and 0.635 bpp (C=32) are obtained empirically. All further experiments are performed with these three operating points and the results for the same are presented in the following sections.<br />
<br />
==Image Classification from Compressed Representations==<br />
<br />
=== Classification on RGB Images ===<br />
<br />
For the image classification task based on the RGB images, the authors use the ResNet-50 architecture. <br />
Further information on residual networks can be found in the following links: <br />
[https://youtu.be/K0uoBKBQ1gA ResNets Part-1]<br />
[https://youtu.be/GSsKdtoatm8 ResNets Part-2]<br />
<br />
The details of the architecture are presented in the table below:<br />
<br />
[[File:AR_Tab1.png|400px|center]]<br />
<br />
In this paper, the number of 14x14 (conv4_x) blocks have been modified to obtain a new architecture called ResNet-71. <br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
For input images with spatial dimension 224x224, the encoder of the compression network outputs a compressed representation with dimensions 28x28xC, where C is the number of channels. To use this compressed representation as input to the classification network, a simple variant of the ResNet architecture is proposed. This variant is referred to as cResNet-k, where c stands for “compressed representation” and k is the<br />
number of convolutional layers in the network. These networks are constructed by simply “cutting off” the front of the regular (RGB) ResNet. The root-block of the network and the residual layers that have a larger spatial dimension than 28x28 are removed. To adjust the number of layers k, the ResNet architecture proposed by He et al. (2015) is used and the number of 14x14 (conv4 x) residual blocks are modified.<br />
<br />
In this way, three different architectures are derived:<br />
* cResNet-39 is ResNet-50 with the first 11 layers removed as described above, and this significantly reduces computational cost<br />
* cResNet-51<br />
* cResNet-72<br />
<br />
cResNet-51 and cResNet-72 are obtained by adding 14x14 residual blocks to match the computational cost of ResNet-50 and ResNet-71 respectively.<br />
<br />
The detailed description of all the network architectures are presented below:<br />
<br />
[[File:AR_Tab3.png|600px|center]]<br />
<br />
==Semantic Segmentation from Compressed Representations==<br />
<br />
For semantic segmentation, the ResNet based DeepLab architecture is adapted for the proposed application. The cResNet<br />
and ResNet image classification architectures are re-purposed with atrous<br />
convolutions, where the filters are upsampled instead of downsampling the feature maps. This is<br />
done to increase their receptive field and to prevent aggressive subsampling of the feature maps. For segmentation, the ResNet architecture is restructured such<br />
that the output feature map has 8 times smaller spatial dimension than the original RGB image (instead<br />
subsampling by a factor 32 times like for classification). When using the cResNets the output feature<br />
map has the same spatial dimensions as the input compressed representation (instead of subsampling<br />
4 times like for classification). This results in comparably sized feature maps for both the compressed<br />
representation and the reconstructed RGB images. Finally the last 1000-way classification layer of<br />
these classification architectures is replaced by an atrous spatial pyramid pooling (ASPP) with four<br />
parallel branches with rates {6, 12, 18, 24}, which provides the final pixel-wise classification.<br />
<br />
==Joint Training for Compression and Image Classification==<br />
<br />
The authors propose a joint training strategy to combine compression and classification tasks. To do this, the proposed method combines the compression network and the cResNet-51 architecture. The figure below shows the combined pipeline:<br />
<br />
[[File:AR_Fig2b.png|300px|center]]<br />
<br />
All parts, encoder, decoder, and inference network, are trained at the same time. The compressed representation is fed<br />
to the decoder to optimize for mean-squared reconstruction error and to a cResNet-51 network to<br />
optimize for classification using a cross-entropy loss. The combined loss function takes the form:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \gamma(\text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0))+l_{ce}(y,{\hat{y}})<br />
\end{align}<br />
<br />
where the loss terms for the compression network, <math> \mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)</math>, are the same as in training for compression only. <math> l_{ce}</math> is the cross-entropy loss for classification.<br />
<math>\gamma </math> controls the trade-off between the compression loss and the classification loss.<br />
<br />
==Experiments and Results==<br />
<br />
=== Learned Deeply Compressed Representations Results ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset.<br />
<br />
The metrics used to measure the compression quality are as follows: <br />
* PSNR (Peak Signal-to-Noise Ratio) is a standard measure, depending monotonically on mean squared error defined as: <br />
<br />
\begin{align}<br />
PSNR = 10(\log_{10}(255^2/MSE))<br />
\end{align}<br />
<br />
* SSIM (Structural Similarity Index) and MS-SSIM (Multi-Scale SSIM) are metrics proposed to measure the similarity of images as perceived by humans<br />
<br />
The figure below depicts the performance of the deep compression models vs. standard JPEG and JPEG2000. Higher values are better. The proposed technique outperforms the JPEG and JPEC2000 at the operating points used in this paper.<br />
<br />
[[File:AR_Fig8.png|600px|center]]<br />
<br />
The learned compressed representations are illustrated in the figure below. <br />
<br />
[[File:AR_Fig9.png|500px|center]]<br />
<br />
In the above figure, the original RGB-image is shown along with compressed versions of the RGB image which are reconstructed from the compressed representations. The 4 channels with the highest entropy are shown in the visualizations. These visualizations indicate how the networks compress an image, as the rate (bpp) gets lower the entropy cost of the network forces the<br />
compressed representation to use fewer quantization levels, as can clearly be seen. For the most aggressive compression, the channel maps use only 2 levels for the compressed representation.<br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset. It consists of 1.28 million training images and 50k validation images. These images are distributed across 1000 diverse classes. For image classification, the top-1 classification accuracy and top-5 classification accuracy are reported on the validation set on 224x224 center crops for RGB images and 28x28 center crops for the compressed representation.<br />
<br />
==== Training Procedure ====<br />
<br />
The compression network is fixed while training the classification network, both when training with compressed representations and with reconstructed compressed RGB images. For the compressed representations, the output of the fixed encoder (the compressed representation) is provided input to the cResNets (decoder is not needed). When training on the reconstructed compressed RGB images, the output of the fixed encoder-decoder (RGB image) is provided as input to the ResNet. This is done for each operating point.<br />
<br />
Refer to Appendix A Section A4, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Classification Results ====<br />
<br />
The tables below present the results of the classification at each operating point, both classifying from the compressed representation and the corresponding reconstructed compressed RGB images.<br />
<br />
[[File:AR_Tab2.png|400|center]]<br />
<br />
Figure below shows the validation curves for ResNet-50, cResNet-51, and cResNet-39. <br />
<br />
[[File:AR_Fig3.png|700|center]]<br />
<br />
For the 2 classification architectures with the same computational complexity (ResNet-50 and cResNet-51), the validation curves at the 0.635 bpp compression operating point almost coincide, with ResNet-50 performing slightly better. As the rate (bpp) gets smaller this performance gap gets smaller. The table above shows the<br />
classification results when the different architectures have converged. At the 0.635 bpp operating point, ResNet-50 only performs 0.5% better in top-5 accuracy than cResNet-51, while for the 0.0983 bpp operating point this difference is only 0.3%.<br />
Using the same pre-processing and the same learning rate schedule but starting from the original uncompressed RGB images yields 89.96% top-5 accuracy. The top-5 accuracy obtained from the compressed representation at the 0.635 bpp compression operating point, 87.85%, is even competitive<br />
with that obtained for the original images at a significantly lower storage cost. Specifically, at 0.635 bpp the ImageNet dataset requires 24.8 GB of storage space instead of 144 GB for the original version, a reduction by a factor 5.8 times.<br />
<br />
Notes on top-1 and top-5 accuracy:<br />
<br />
* Top-1 accuracy: This is the conventional accuracy metric used in machine learning. Wherein if the true label of the input to a model matches the highest probability class of the last layer of the output of CNN (predicted class probability), then the given input is correctly classified, else it is considered as incorrectly classified.<br />
* Top-5 accuracy: In this case, if any of the model's 5 highest classification probabilities match with the true label of the input, then this is considered as a correct classification, else it is an incorrect classification.<br />
<br />
===Semantic Segmentation Results===<br />
<br />
All experiments have been performed on the PASCAL VOC-2012 dataset for semantic segmentation. It has 20 object foreground classes and 1 background class. The dataset<br />
consists of 1464 training and 1449 validation images. In every image, each pixel is annotated with<br />
one of the 20 + 1 classes. The original dataset is furthermore augmented with extra annotations, so the final dataset has 10,582 images for training and 1449 images for validation.<br />
<br />
All performance is measured on pixel wise intersection-over-union (IoU) averaged over all the classes or mean-intersection-over-union (mIoU) on the validation set. <br />
<br />
[https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ Details on IoU]<br />
<br />
==== Training Procedure ====<br />
The cResNet/ResNet networks are pre-trained on the ImageNet dataset using the procedure described earlier on the image classification task, the encoder and decoder is fixed as in the earlier scenario. The architectures are then adapted with dilated convolutions, cResNet-d/ResNet-d, and<br />
finetuned on the semantic segmentation task.<br />
<br />
Refer to Appendix A Section A5, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Segmentation Results ====<br />
<br />
The table below shows the mIoU results for the segmentation task.<br />
<br />
[[File:AR_Tab2.png|450|center]]<br />
<br />
The figure below illustrates the segmentation results with respect to each compression operating point.<br />
<br />
[[File:AR_Fig4.png|700|center]]<br />
<br />
For semantic segmentation ResNet-50-d and cResNet-51-d perform equally well at the 0.635 bpp compression operating point. For the<br />
0.330 bpp operating point, segmentation from the compressed representation performs slightly better, 0.37%, and at the 0.0983 bpp operating point segmentation from the compressed representation<br />
performs considerably better than for the reconstructed compressed RGB images, by 1.65%.<br />
<br />
[[File:AR_Fig5.png|600px|center]]<br />
<br />
The above figure shows the predicted segmentation visually for both the cResNet-51-d and the ResNet-50-d<br />
architecture at each operating point. Along with the segmentation, it also shows the original uncompressed<br />
RGB image and the reconstructed compressed RGB image. These images highlight<br />
the challenging nature of these segmentation tasks, but they can nevertheless be performed using the<br />
compressed representation. They also clearly indicate that the compression affects the segmentation,<br />
as lowering the rate (bpp) progressively removes details in the image. Comparing the segmentation<br />
from the reconstructed RGB images to the segmentation from the compressed representation visually,<br />
the performance is similar.<br />
<br />
The figure below is another example of visual results of segmentation from compressed representation and reconstructed RGB<br />
images. The performance is visually similar for all operating points except for the 0.0983<br />
bpp operating point where the reconstructed RGB image fails to capture the back part of<br />
the train, while the compressed representation manages to capture that aspect of the image in the<br />
segmentation.<br />
<br />
[[File:AR_Fig10.png|600px|center]]<br />
<br />
=== Results on Computational Gains ===<br />
<br />
[[File:AR_Fig6.png|400px|center]]<br />
<br />
=====Computational Gains on Classification=====<br />
<br />
The figure on the left illustrates, the top-5 classification accuracy as a function of computational<br />
complexity for the 0.0983 bpp compression operating point.<br />
Looking at a fixed computational cost, the reconstructed compressed RGB images perform about 0.25% better. Looking at a fixed classification cost, inference from the compressed representation costs about <math>0.6 * 10^9</math> FLOPs more. However when accounting for the decoding cost at a fixed<br />
classification performance, inference from the reconstructed compressed RGB images costs <math>2.2*10^9</math> FLOPs more than inference from the compressed representation.<br />
<br />
=====Computational Gains on Segmentation=====<br />
<br />
In the figure on the right illustrates, the mIoU validation performance is shown as a function of computational complexity for<br />
the 0.0983 bpp compression operating point. <br />
Here, even without accounting for the decoding cost of the reconstructed images, the compressed representation<br />
performs better. At a fixed computational cost, segmentation from the compressed representation gives about 0.7% better mIoU. And at a fixed mIoU the computational cost is about <math>3.3*10^9</math> FLOPs<br />
lower for compressed representations. Accounting for the decoding costs this difference becomes <math>6.1*10^9</math> FLOPs. due to the nature of the dilated convolutions and the increased feature map size, the<br />
relative computational gains for segmentation are not as pronounced as for classification.<br />
<br />
===Joint Training for Compression and Image Classification===<br />
<br />
==== Training Procedure ====<br />
<br />
When doing joint training, the compression network and the classification networks are first initialized<br />
from a trained state obtained as described previously. After initialization, the networks are<br />
both finetuned jointly. For a detailed<br />
description of hyperparameters used and the training schedule see Appendix A8 in the paper.<br />
<br />
To control that the change in classification accuracy is not only due to (1) a better compression<br />
operating point or (2) the fact that the cResNet is trained longer, the following is done. A new operating point is obtained by finetuning the compression network only using the schedule described<br />
above. The cResNet-51 is trained on top of this new operating point from scratch. Finally, the compression network is fixed at the new operating point, and the cResNet-51 is trained for 9 epochs. <br />
<br />
To obtain segmentation results, the jointly trained network is used. The operating point is fixed and the jointly finetuned classification network is adopted fro segmentation (cResNet-51-d).<br />
<br />
==== Joint Training Results ====<br />
<br />
[[File:AR_Fig7.png|400px|center]]<br />
<br />
It can be seen from the figure, that the classification and segmentation results “move<br />
up” from the baseline through fine tuning. When training jointly the improvement for classification are larger and<br />
a significant improvement for segmentation is achieved. For the 0.635 bpp operating point the classification performance is similar for training the network jointly and training<br />
the compression network only, but when using these operating points for segmentation the difference is considerable.<br />
<br />
The results presented by the authors suggest an improvement in classification by 2%, a performance gain which would<br />
require an additional 75% of the computational complexity of cResNet-51. The segmentation<br />
performance after training the networks jointly is 1.7% better in mIoU than training only<br />
the compression network.<br />
<br />
==Critique==<br />
<br />
The paper proposes how previous work in auto-encoders and image compression can be extended effectively to a novel task of a combined image compression and recognition task. The work has provided extensive experimental evaluation and evidence that suggests that learned compressed representations can be effective in classification and segmentation tasks. While maintaining the performance of the techniques to state of the art performance, the authors show that the proposed method can offer significant computational gains. The applications of this can be in<br />
multimedia communication, wireless transmission of images, video surveillance on the mobile edge, etc. With the advent of 5G and other new wireless technologies, this method offers capabilities that can be utilized to conserve wireless bandwidth, savings on storage while retaining the perceptual quality of images.<br />
The joint training of compression and classification network provides some added advantages and also shows that at aggressive compression rates the performance in classification and segmentation can be improved significantly.<br />
<br />
Another critique is the authors did not answer the question of why we want to do image understanding from a compressed space. From the intuitive sense, the learning algorithm could easily just learn from the original feature space, which obviously contains more information. The troubling part is that the author does not answer a more fundamental question of why learning from a compressed space would bring any benefit compared to learning directly from the original feature space.<br />
<br />
The authors mention that the complexity of the current approach is still high in comparison with methods like JPEG or JPEG2000. They also mention that this can be overcome when the networks are trained and run on GPU's. Although this has been seen as a drawback, with subsequent improvements in physical hardware and more specialized deep learning platforms, the limitation of the current approach can be overcome. While the authors did thorough experiments and gave extensive results on compressed representations and their advantages, the idea itself is not very novel.Finally, in the light of providing extensive experimental contributions,<br />
the authors have written a quite lengthy paper. There are parts of the paper where the ideas have been repeated frequently, and this could've been avoided leading to a more well-balanced length of the article.<br />
<br />
* ([[https://openreview.net/forum?id=HkXWCMbRW]]) As it is mentioned in the paper, solving a Vision problem directly from a compressed image, is not a novel method (e.g: DCT coefficients were used for both vision and audio data to solve a task without any decompression).<br />
<br />
==Conclusion==<br />
<br />
The paper proposes an inference task using compressed image representations without the need to decode for classification and semantic segmentation. The paper has successfully demonstrated through a set of rigorous experiments the approach<br />
for performing the intended tasks. The results show significant improvements in computational complexity while maintaining state of the art classification and segmentation performance. The authors also intend to explore other computer vision tasks based on using compressed representation as part of the future work. They also suggest that this could potentially lead to gaining a better understanding of the features/compressed representations learned by image compression networks leading to applications in unsupervised or semi-supervised learning.<br />
<br />
==References==<br />
# Torfason, R., Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., & Van Gool, L. (2018). Towards image understanding from deep compression without decoding. arXiv preprint arXiv:1803.06131.<br />
# Theis, L., Shi, W., Cunningham, A., & Huszár, F. (2017). Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395.<br />
# Agustsson, E., Mentzer, F., Tschannen, M., Cavigelli, L., Timofte, R., Benini, L., & Gool, L. V. (2017). Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems (pp. 1141-1151).<br />
# He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
# Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding&diff=42168stat946w18/Towards Image Understanding From Deep Compression Without Decoding2018-12-01T00:15:37Z<p>Z43ma: Grammer update.</p>
<hr />
<div>Paper Title: Towards Image Understanding from Deep Compression Without Decoding - ICLR 2018<br />
<br />
Presented By: Aravind Ravi<br />
<br />
== Introduction ==<br />
Recent advances in the deep neural network (DNN) based image compression methods have shown potential improvements in image quality, savings in storage and bandwidth reduction. These methods leverage common neural network architectures such as convolutional autoencoders or recurrent neural networks to compress and reconstruct RGB images and outperform classical techniques such as JPEG2000 and BPG on perceptual metrics such as structural similarity index (SSIM) and multi-scale structural similarity index (MS-SSIM).<br />
<br />
These approaches encode an image <math>x </math> to some feature map (compressed representation), which is subsequently quantized to a set of symbols <math>z </math>. These symbols are then losslessly compressed to a bitstream, from which a decoder reconstructs an image <math>{\hat{x}} </math>, of the same dimensions as <math>x </math>.<br />
<br />
Learned compression algorithms have an advantage over engineering compression algorithms in that they can be much more easily adapted to specific domains. For example, a learned compression algorithm might be able to learn good performance on compressing medical images, without specifically tuning the algorithm.<br />
<br />
In this paper, the authors explore the idea of applying the learned representations to perform inference without reconstructing the compressed image. Specifically, instead of reconstructing an RGB image from the compressed representation and feeding it to a network for inference, the paper proposes to use a modified network that bypasses reconstruction of the RGB image.<br />
<br />
The rationale behind this approach is that the neural network architectures commonly used for learned compression (in particular the encoders) are similar to the ones commonly used for inference, and learned image encoders are hence, in principle, capable of extracting features relevant for inference tasks. The encoder might learn features relevant for inference purely by training on the compression task, and can be forced to learn these features by training on the compression and inference tasks jointly.<br />
<br />
The advantage of learning an encoder for image compression which produces compressed representation containing features relevant for inference is obvious in scenarios where images are transmitted (e.g. from a mobile device) before processing (e.g. in the cloud), as it saves reconstruction of the RGB image as well as part of the feature extraction and hence speeds up processing. A typical use case is a cloud photo storage application where every image is processed immediately upon upload for indexing and search purposes.<br />
<br />
Note: [https://en.wikipedia.org/wiki/Structural_similarity More Information on SSIM, MSSIM]<br />
<br />
== Intuition ==<br />
<br />
Compression techniques (something as common as zipping) are commonly used by us in day to day file handling tasks. Most often we use engineered compression techniques. Deep Neural Networks (DNNs) are nonlinear function approximators which act as feature extractors, extracting features from inputs (like images or sound files). These can be seen as learning based compression techniques as they can perform compression and they can be trained using back propagation as well. If image classification can be done on these compressed files, large image data sets like hyperspectral images and MRI images can be stored efficiently and the compressed files can be used directly by the DNNs for classification or reinforcement learning tasks.<br />
<br />
==Motivation and Contributions==<br />
The authors propose to perform image understanding tasks such as image classification and segmentation directly on DNN based compressed representations. Performing the image understanding tasks on the compressed representations/encoded feature maps has two advantages. <br />
# This method bypasses the process of decoding the image into the RGB space before classification.<br />
# The authors show that it reduces the overall computational complexity up to 2 times.<br />
<br />
=== Contributions of the Paper ===<br />
* A method to perform image classification and semantic segmentation from compressed representations. In large scale image understanding problems, learning from a compressed representation is definitely something that is interesting. <br />
* The proposed method offers classification accuracy similar to that achieved on decompressed images while reducing the computational complexity by 2 times.<br />
* Semantic segmentation has been shown to be as accurate as performance on decompressed images for moderate compression rates and higher accuracy for aggressive compression rates. In addition, this method achieves lower computational complexity.<br />
* Joint training for image compression and classification has been shown to improve the quality of the image and increase in accuracy of classification and segmentation<br />
<br />
==Related Work==<br />
<br />
The prior work has shown image classification from compressed images based on engineered codecs. Some of the works in this area are:<br />
<br />
* In video analysis domain: Action recognition (Yeo et al., 2008; Kantorov & Laptev, 2014)<br />
* Classification of compressed hyperspectral images (Hahn et al., 2014; Aghagolzadeh & Radha, 2015)<br />
* Discrete Cosine Transform based compression performed on images before feeding into a neural network, which shows an improvement in training speed by up to 10 times Fu & Guimaraes (2016)<br />
* Video analysis on compressed video (using engineered codecs) has also been studied in the past (Babu et al., 2016)<br />
* Criticism on document image analysis methods (Javed et al.2017)<br />
<br />
The authors propose a method that does inference on top of learned feature representation and hence has a direct relation to unsupervised feature learning using autoencoders.<br />
They also claim that so far there hasn't been any work using learned compressed representations for image classification and segmentation.<br />
<br />
==Learned Deeply Compressed Representations==<br />
<br />
The image compression task is performed based on a convolutional autoencoder architecture proposed by Theis et al. 2017 (shown in the figure below), and a variant of the training procedure described by Agustsson et. al 2017. <br />
<br />
[[File:AR_theisAutoencoder.png|600px|center]]<br />
<br />
Some points to better understand the architecture:<br />
<br />
1. Most convolutions are done in a convolved, lower-dimensional space to speed up computation<br />
<br />
2. Different activation functions are used. Blank arrows indicate the identity function (no additional linearity), while black arrows indicate leaky rectifications<br />
<br />
3. The “round” box simply rounds all elements in the tensor to the nearest integer<br />
<br />
4. The “subpix” block is just an upsampling /reconstruction block where the feature map’s coefficients are reshuffled after a convolution<br />
<br />
<br />
<br />
=== Compression Architecture ===<br />
<br />
The compression network is an autoencoder that takes an input image <math>x </math> and outputs <math>{\hat{x}} </math> as the approximation to the input. <br />
<br />
[[File:AR_Fig2a.png|300px|center]]<br />
<br />
The encoder has the following structure: It starts with 2 convolutional layers with spatial subsampling by a factor of 2, followed by 3 residual units, and a final convolutional layer with spatial subsampling by a factor of 2. This results in a <math>w/8</math> x <math>h/8</math> x <math>C</math> dimensional representation, where <math>w </math> and <math>h </math> are the spatial dimensions of <math>x </math>, and the number of channels C is a hyperparameter related to the rate <math>R </math>. This representation is then quantized to a discrete set of symbols, forming a compressed representation, <math>z </math>.<br />
<br />
To get the reconstruction <math>{\hat{x}} </math>, the compressed representation is fed into the decoder, which mirrors the encoder, but uses upsampling and deconvolutions instead of subsampling and convolutions.<br />
<br />
Quantizing the compressed representation imposes a distortion <math>D </math> on <math>{\hat{x}} </math> w.r.t. <math>x </math>, i.e., it increases the reconstruction error. This is traded for a decrease in entropy of the quantized compressed representation<br />
<math>z </math> which leads to a decrease of the length of the bitstream as measured by the rate <math>R </math>. Thus, to train the image compression network, the classical rate-distortion trade-off <math>D + \beta R</math> is minimized. As a metric for <math>D </math>, the mean squared error (MSE) between <math>x </math> and <math>{\hat{x}} </math> are used and <math>R</math> is estimated using<br />
<math>H(q)</math>. <math>H(q)</math> is the entropy of the probability distribution over the symbols and is estimated using a histogram of the probability distribution (as done by Agustsson et al., 2017). The trade-off between MSE and the entropy is controlled by adjusting <math>\beta </math>. For each <math>\beta </math> an operating point is derived where the images have a certain bit rate, as measured by bits per pixel (bpp), and corresponding MSE. To better control the bpp, a target entropy Ht is introduced by the authors to formulate the loss defined as:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)<br />
\end{align}<br />
<br />
Agustsson et. al 2017, proposed a method to overcome the issue of non-differentiability of the quantization step by proposing a differentiable approximation to the quantization. This method has been adapted to suit the current application in the paper.<br />
<br />
Three operating points at 0.0983 bpp (C=8), 0.330 bpp (C=16), and 0.635 bpp (C=32) are obtained empirically. All further experiments are performed with these three operating points and the results for the same are presented in the following sections.<br />
<br />
==Image Classification from Compressed Representations==<br />
<br />
=== Classification on RGB Images ===<br />
<br />
For the image classification task based on the RGB images, the authors use the ResNet-50 architecture. <br />
Further information on residual networks can be found in the following links: <br />
[https://youtu.be/K0uoBKBQ1gA ResNets Part-1]<br />
[https://youtu.be/GSsKdtoatm8 ResNets Part-2]<br />
<br />
The details of the architecture are presented in the table below:<br />
<br />
[[File:AR_Tab1.png|400px|center]]<br />
<br />
In this paper, the number of 14x14 (conv4_x) blocks have been modified to obtain a new architecture called ResNet-71. <br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
For input images with spatial dimension 224x224, the encoder of the compression network outputs a compressed representation with dimensions 28x28xC, where C is the number of channels. To use this compressed representation as input to the classification network, a simple variant of the ResNet architecture is proposed. This variant is referred to as cResNet-k, where c stands for “compressed representation” and k is the<br />
number of convolutional layers in the network. These networks are constructed by simply “cutting off” the front of the regular (RGB) ResNet. The root-block of the network and the residual layers that have a larger spatial dimension than 28x28 are removed. To adjust the number of layers k, the ResNet architecture proposed by He et al. (2015) is used and the number of 14x14 (conv4 x) residual blocks are modified.<br />
<br />
In this way, three different architectures are derived:<br />
* cResNet-39 is ResNet-50 with the first 11 layers removed as described above, and this significantly reduces computational cost<br />
* cResNet-51<br />
* cResNet-72<br />
<br />
cResNet-51 and cResNet-72 are obtained by adding 14x14 residual blocks to match the computational cost of ResNet-50 and ResNet-71 respectively.<br />
<br />
The detailed description of all the network architectures are presented below:<br />
<br />
[[File:AR_Tab3.png|600px|center]]<br />
<br />
==Semantic Segmentation from Compressed Representations==<br />
<br />
For semantic segmentation, the ResNet based DeepLab architecture is adapted for the proposed application. The cResNet<br />
and ResNet image classification architectures are re-purposed with atrous<br />
convolutions, where the filters are upsampled instead of downsampling the feature maps. This is<br />
done to increase their receptive field and to prevent aggressive subsampling of the feature maps. For segmentation, the ResNet architecture is restructured such<br />
that the output feature map has 8 times smaller spatial dimension than the original RGB image (instead<br />
subsampling by a factor 32 times like for classification). When using the cResNets the output feature<br />
map has the same spatial dimensions as the input compressed representation (instead of subsampling<br />
4 times like for classification). This results in comparably sized feature maps for both the compressed<br />
representation and the reconstructed RGB images. Finally the last 1000-way classification layer of<br />
these classification architectures is replaced by an atrous spatial pyramid pooling (ASPP) with four<br />
parallel branches with rates {6, 12, 18, 24}, which provides the final pixel-wise classification.<br />
<br />
==Joint Training for Compression and Image Classification==<br />
<br />
The authors propose a joint training strategy to combine compression and classification tasks. To do this, the proposed method combines the compression network and the cResNet-51 architecture. The figure below shows the combined pipeline:<br />
<br />
[[File:AR_Fig2b.png|300px|center]]<br />
<br />
All parts, encoder, decoder, and inference network, are trained at the same time. The compressed representation is fed<br />
to the decoder to optimize for mean-squared reconstruction error and to a cResNet-51 network to<br />
optimize for classification using a cross-entropy loss. The combined loss function takes the form:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \gamma(\text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0))+l_{ce}(y,{\hat{y}})<br />
\end{align}<br />
<br />
where the loss terms for the compression network, <math> \mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)</math>, are the same as in training for compression only. <math> l_{ce}</math> is the cross-entropy loss for classification.<br />
<math>\gamma </math> controls the trade-off between the compression loss and the classification loss.<br />
<br />
==Experiments and Results==<br />
<br />
=== Learned Deeply Compressed Representations Results ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset.<br />
<br />
The metrics used to measure the compression quality are as follows: <br />
* PSNR (Peak Signal-to-Noise Ratio) is a standard measure, depending monotonically on mean squared error defined as: <br />
<br />
\begin{align}<br />
PSNR = 10(\log_{10}(255^2/MSE))<br />
\end{align}<br />
<br />
* SSIM (Structural Similarity Index) and MS-SSIM (Multi-Scale SSIM) are metrics proposed to measure the similarity of images as perceived by humans<br />
<br />
The figure below depicts the performance of the deep compression models vs. standard JPEG and JPEG2000. Higher values are better. The proposed technique outperforms the JPEG and JPEC2000 at the operating points used in this paper.<br />
<br />
[[File:AR_Fig8.png|600px|center]]<br />
<br />
The learned compressed representations are illustrated in the figure below. <br />
<br />
[[File:AR_Fig9.png|500px|center]]<br />
<br />
In the above figure, the original RGB-image is shown along with compressed versions of the RGB image which are reconstructed from the compressed representations. The 4 channels with the highest entropy are shown in the visualizations. These visualizations indicate how the networks compress an image, as the rate (bpp) gets lower the entropy cost of the network forces the<br />
compressed representation to use fewer quantization levels, as can clearly be seen. For the most aggressive compression, the channel maps use only 2 levels for the compressed representation.<br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset. It consists of 1.28 million training images and 50k validation images. These images are distributed across 1000 diverse classes. For image classification, the top-1 classification accuracy and top-5 classification accuracy are reported on the validation set on 224x224 center crops for RGB images and 28x28 center crops for the compressed representation.<br />
<br />
==== Training Procedure ====<br />
<br />
The compression network is fixed while training the classification network, both when training with compressed representations and with reconstructed compressed RGB images. For the compressed representations, the output of the fixed encoder (the compressed representation) is provided input to the cResNets (decoder is not needed). When training on the reconstructed compressed RGB images, the output of the fixed encoder-decoder (RGB image) is provided as input to the ResNet. This is done for each operating point.<br />
<br />
Refer to Appendix A Section A4, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Classification Results ====<br />
<br />
The tables below present the results of the classification at each operating point, both classifying from the compressed representation and the corresponding reconstructed compressed RGB images.<br />
<br />
[[File:AR_Tab2.png|400|center]]<br />
<br />
Figure below shows the validation curves for ResNet-50, cResNet-51, and cResNet-39. <br />
<br />
[[File:AR_Fig3.png|700|center]]<br />
<br />
For the 2 classification architectures with the same computational complexity (ResNet-50 and cResNet-51), the validation curves at the 0.635 bpp compression operating point almost coincide, with ResNet-50 performing slightly better. As the rate (bpp) gets smaller this performance gap gets smaller. The table above shows the<br />
classification results when the different architectures have converged. At the 0.635 bpp operating point, ResNet-50 only performs 0.5% better in top-5 accuracy than cResNet-51, while for the 0.0983 bpp operating point this difference is only 0.3%.<br />
Using the same pre-processing and the same learning rate schedule but starting from the original uncompressed RGB images yields 89.96% top-5 accuracy. The top-5 accuracy obtained from the compressed representation at the 0.635 bpp compression operating point, 87.85%, is even competitive<br />
with that obtained for the original images at a significantly lower storage cost. Specifically, at 0.635 bpp the ImageNet dataset requires 24.8 GB of storage space instead of 144 GB for the original version, a reduction by a factor 5.8 times.<br />
<br />
Notes on top-1 and top-5 accuracy:<br />
<br />
* Top-1 accuracy: This is the conventional accuracy metric used in machine learning. Wherein if the true label of the input to a model matches the highest probability class of the last layer of the output of CNN (predicted class probability), then the given input is correctly classified, else it is considered as incorrectly classified.<br />
* Top-5 accuracy: In this case, if any of the model's 5 highest classification probabilities match with the true label of the input, then this is considered as a correct classification, else it is an incorrect classification.<br />
<br />
===Semantic Segmentation Results===<br />
<br />
All experiments have been performed on the PASCAL VOC-2012 dataset for semantic segmentation. It has 20 object foreground classes and 1 background class. The dataset<br />
consists of 1464 training and 1449 validation images. In every image, each pixel is annotated with<br />
one of the 20 + 1 classes. The original dataset is furthermore augmented with extra annotations, so the final dataset has 10,582 images for training and 1449 images for validation.<br />
<br />
All performance is measured on pixel wise intersection-over-union (IoU) averaged over all the classes or mean-intersection-over-union (mIoU) on the validation set. <br />
<br />
[https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ Details on IoU]<br />
<br />
==== Training Procedure ====<br />
The cResNet/ResNet networks are pre-trained on the ImageNet dataset using the procedure described earlier on the image classification task, the encoder and decoder is fixed as in the earlier scenario. The architectures are then adapted with dilated convolutions, cResNet-d/ResNet-d, and<br />
finetuned on the semantic segmentation task.<br />
<br />
Refer to Appendix A Section A5, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Segmentation Results ====<br />
<br />
The table below shows the mIoU results for the segmentation task.<br />
<br />
[[File:AR_Tab2.png|450|center]]<br />
<br />
The figure below illustrates the segmentation results with respect to each compression operating point.<br />
<br />
[[File:AR_Fig4.png|700|center]]<br />
<br />
For semantic segmentation ResNet-50-d and cResNet-51-d perform equally well at the 0.635 bpp compression operating point. For the<br />
0.330 bpp operating point, segmentation from the compressed representation performs slightly better, 0.37%, and at the 0.0983 bpp operating point segmentation from the compressed representation<br />
performs considerably better than for the reconstructed compressed RGB images, by 1.65%.<br />
<br />
[[File:AR_Fig5.png|600px|center]]<br />
<br />
The above figure shows the predicted segmentation visually for both the cResNet-51-d and the ResNet-50-d<br />
architecture at each operating point. Along with the segmentation, it also shows the original uncompressed<br />
RGB image and the reconstructed compressed RGB image. These images highlight<br />
the challenging nature of these segmentation tasks, but they can nevertheless be performed using the<br />
compressed representation. They also clearly indicate that the compression affects the segmentation,<br />
as lowering the rate (bpp) progressively removes details in the image. Comparing the segmentation<br />
from the reconstructed RGB images to the segmentation from the compressed representation visually,<br />
the performance is similar.<br />
<br />
The figure below is another example of visual results of segmentation from compressed representation and reconstructed RGB<br />
images. The performance is visually similar for all operating points except for the 0.0983<br />
bpp operating point where the reconstructed RGB image fails to capture the back part of<br />
the train, while the compressed representation manages to capture that aspect of the image in the<br />
segmentation.<br />
<br />
[[File:AR_Fig10.png|600px|center]]<br />
<br />
=== Results on Computational Gains ===<br />
<br />
[[File:AR_Fig6.png|400px|center]]<br />
<br />
=====Computational Gains on Classification=====<br />
<br />
The figure on the left illustrates, the top-5 classification accuracy as a function of computational<br />
complexity for the 0.0983 bpp compression operating point.<br />
Looking at a fixed computational cost, the reconstructed compressed RGB images perform about 0.25% better. Looking at a fixed classification cost, inference from the compressed representation costs about 0.6 * 10^9 FLOPs more. However when accounting for the decoding cost at a fixed<br />
classification performance, inference from the reconstructed compressed RGB images costs 2.2*10^9 FLOPs more than inference from the compressed representation.<br />
<br />
=====Computational Gains on Segmentation=====<br />
<br />
In the figure on the right illustrates, the mIoU validation performance is shown as a function of computational complexity for<br />
the 0.0983 bpp compression operating point. <br />
Here, even without accounting for the decoding cost of the reconstructed images, the compressed representation<br />
performs better. At a fixed computational cost, segmentation from the compressed representation gives about 0.7% better mIoU. And at a fixed mIoU the computational cost is about 3.3*10^9 FLOPs<br />
lower for compressed representations. Accounting for the decoding costs this difference becomes 6.1*10^9 FLOPs. due to the nature of the dilated convolutions and the increased feature map size, the<br />
relative computational gains for segmentation are not as pronounced as for classification.<br />
<br />
===Joint Training for Compression and Image Classification===<br />
<br />
==== Training Procedure ====<br />
<br />
When doing joint training, the compression network and the classification networks are first initialized<br />
from a trained state obtained as described previously. After initialization, the networks are<br />
both finetuned jointly. For a detailed<br />
description of hyperparameters used and the training schedule see Appendix A8.<br />
<br />
To control that the change in classification accuracy is not only due to (1) a better compression<br />
operating point or (2) the fact that the cResNet is trained longer, the following is done. A new operating point is obtained by finetuning the compression network only using the schedule described<br />
above. The cResNet-51 is trained on top of this new operating point from scratch. Finally, the compression network is fixed at the new operating point, and the cResNet-51 is trained for 9 epochs. <br />
<br />
To obtain segmentation results, the jointly trained network is used. The operating point is fixed and the jointly finetuned classification network is adopted fro segmentation (cResNet-51-d).<br />
<br />
==== Joint Training Results ====<br />
<br />
[[File:AR_Fig7.png|400px|center]]<br />
<br />
It can be seen from the figure, that the classification and segmentation results “move<br />
up” from the baseline through fine tuning. When training jointly the improvement for classification are larger and<br />
a significant improvement for segmentation is achieved. For the 0.635 bpp operating point the classification performance is similar for training the network jointly and training<br />
the compression network only, but when using these operating points for segmentation the difference is considerable.<br />
<br />
The results presented by the authors suggest an improvement in classification by 2%, a performance gain which would<br />
require an additional 75% of the computational complexity of cResNet-51. The segmentation<br />
performance after training the networks jointly is 1.7% better in mIoU than training only<br />
the compression network.<br />
<br />
==Critique==<br />
<br />
The paper proposes how previous work in auto-encoders and image compression can be extended effectively to a novel task of a combined image compression and recognition task. The work has provided extensive experimental evaluation and evidence that suggests that learned compressed representations can be effective in classification and segmentation tasks. While maintaining the performance of the techniques to state of the art performance, the authors show that the proposed method can offer significant computational gains. The applications of this can be in<br />
multimedia communication, wireless transmission of images, video surveillance on the mobile edge, etc. With the advent of 5G and other new wireless technologies, this method offers capabilities that can be utilized to conserve wireless bandwidth, savings on storage while retaining the perceptual quality of images.<br />
The joint training of compression and classification network provides some added advantages and also shows that at aggressive compression rates the performance in classification and segmentation can be improved significantly.<br />
<br />
Another critique is the authors did not answer the question of why we want to do image understanding from a compressed space. From the intuitive sense, the learning algorithm could easily just learn from the original feature space, which obviously contains more information. The troubling part is that the author does not answer a more fundamental question of why learning from a compressed space would bring any benefit compared to learning directly from the original feature space.<br />
<br />
The authors mention that the complexity of the current approach is still high in comparison with methods like JPEG or JPEG2000. They also mention that this can be overcome when the networks are trained and run on GPU's. Although this has been seen as a drawback, with subsequent improvements in physical hardware and more specialized deep learning platforms, the limitation of the current approach can be overcome. While the authors did thorough experiments and gave extensive results on compressed representations and their advantages, the idea itself is not very novel.Finally, in the light of providing extensive experimental contributions,<br />
the authors have written a quite lengthy paper. There are parts of the paper where the ideas have been repeated frequently, and this could've been avoided leading to a more well-balanced length of the article.<br />
<br />
* ([[https://openreview.net/forum?id=HkXWCMbRW]]) As it is mentioned in the paper, solving a Vision problem directly from a compressed image, is not a novel method (e.g: DCT coefficients were used for both vision and audio data to solve a task without any decompression).<br />
<br />
==Conclusion==<br />
<br />
The paper proposes an inference task using compressed image representations without the need to decode for classification and semantic segmentation. The paper has successfully demonstrated through a set of rigorous experiments the approach<br />
for performing the intended tasks. The results show significant improvements in computational complexity while maintaining state of the art classification and segmentation performance. The authors also intend to explore other computer vision tasks based on using compressed representation as part of the future work. They also suggest that this could potentially lead to gaining a better understanding of the features/compressed representations learned by image compression networks leading to applications in unsupervised or semi-supervised learning.<br />
<br />
==References==<br />
# Torfason, R., Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., & Van Gool, L. (2018). Towards image understanding from deep compression without decoding. arXiv preprint arXiv:1803.06131.<br />
# Theis, L., Shi, W., Cunningham, A., & Huszár, F. (2017). Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395.<br />
# Agustsson, E., Mentzer, F., Tschannen, M., Cavigelli, L., Timofte, R., Benini, L., & Gool, L. V. (2017). Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems (pp. 1141-1151).<br />
# He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
# Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN&diff=42167Annotating Object Instances with a Polygon RNN2018-12-01T00:12:29Z<p>Z43ma: Some details in training, grammer update</p>
<hr />
<div>Summary of the CVPR '17 best [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf ''paper'']<br />
<br />
The presentation video of paper is available here[https://www.youtube.com/watch?v=S1UUR4FlJ84].<br />
<br />
= Background =<br />
<br />
If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right beside a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps to reason about the behavior of objects in the scene.<br />
<br />
Automating this process is a classic computer vision problem and is often termed "object detection". There are four distinct levels of detection (refer to Figure 1 for a visual cue):<br />
<br />
1. Classification + Localization: This is the most basic method that detects whether '''an''' object is either present or absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.<br />
<br />
2. Object Detection: The classic definition of object detection points to the detection and localization of '''multiple''' objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding to the location of the objects in the image.<br />
<br />
3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category. <br />
<br />
4. Instance Segmentation (''This paper performs this''): The goal is to not only to assign pixel-level categorical labels, but to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.<br />
<br />
[[File:Figure_1.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]<br />
<br />
<br />
== Motivation ==<br />
<br />
Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.<br />
A polygon is natural form of annotation. Current instant segmentations annotated by humans use polygons because it is a special representation of the image which can use small number of vertices instead of various pixels and makes it easy to incorporate user modifications.<br />
<br />
[[File:polygon.png|600px|center]]<br />
<br />
== Goal ==<br />
<br />
Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large-scale datasets. This is both expensive and time-consuming. <br />
<br />
{| class=wikitable width=700 align=center<br />
|Thus, the '''main goal''' of the paper is to enable '''semi-automatic''' annotation of object instances.<br />
|}<br />
<br />
Figure 2 demonstrates how the interface looks like for better clarity.<br />
<br />
Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with a small number of clicks (30 - 40) compared to other methods. This approach works as the silhouette of an object is typically connected without holes. <br />
<br />
{| class=wikitable width=900 align=center<br />
|Thus, the authors suggest to adopt this same technique to annotate images using polygons, except they plan to automate the method and replace/reduce manual labeling. The '''intuition''' behind the success of this method is the '''sparse''' nature of these polygons that allow annotating of an object through a cluster of pixels rather than classification at the pixel-level.<br />
|}<br />
<br />
[[File:Annotating Object Instances Example.png | 450px|thumb|center|Figure 2: Given a bounding box, polygon outlining the the object instance inside the box is predicted. This approach is designed to facilitation annotation, and easily incorporates user corrections of points to improve the overall object’s polygon. ]]<br />
<br />
<br />
= Related Works =<br />
<br />
Some of the techniques used in semi-automatic annotation are as follows:<br />
<br />
1. '''GrabCut''': In general, GrabCut is a method to separate the foreground and background of an image with minimal user interaction. Specifically, the user need only create a rectangular bounding box containing the foreground, and the algorithm will extract the object in the foreground. A major contribution of the paper is that labelling (of the object in the foreground) was not required, as the algorithm was able to identify where significant changes in colour pattern occurred. In this sense, it mimics automatic segmentation when combined with a Region Proposal Network. <br />
<br />
[[File:GrabCut_Example.png | 450px|thumb|center|Figure 3: Illustration of GrabCut.]]<br />
<br />
2. '''GrabCut + CNN''': Scribbles have also been used to train CNNs for semantic image segmentation. <br />
<br />
3. '''Superpixels''': Superpixels in the form of small polygons where the color intensity within each superpixel is similar, to a certain threshold, have been used to provide a sparse representation of the large number of pixels in an image. However, the performance of this technique depends on the scale of the superpixels and hence sometimes merges small objects.<br />
<br />
[[File:Superpixel_idea.jpg | 450px|thumb|center|Figure 4: Illustration of the superpixel idea.]]<br />
<br />
= Model =<br />
<br />
As an '''input''' to the model, an annotator or perhaps another neural network provides a bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.<br />
<br />
The RNN model predicts the vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently and will be defined shortly. The information regarding the previous two-time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.<br />
<br />
The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.<br />
<br />
== Architecture ==<br />
<br />
There are two primary networks at play: 1. CNN with skip connections, and 2. One-to-many type RNN.<br />
<br />
[[File:Figure_2_Neel.JPG | 800px|thumb|center|Figure 5: Model architecture for Polygon-RNN depicting a CNN with skip connections feeding into a 2 layer ConvLSTM (One-to-many type) ('''Note''': A possible point of confusion - the authors have only shown the layers of VGG16 architecture here that have the skip connections introduced).]]<br />
<br />
1. '''CNN with skip connections''':<br />
<br />
The authors have adopted the VGG16 feature extractor architecture with a few modifications pertaining to the preservation of features fused together in a tensor that can feed into the RNN (refer to Figure 5). Namely, the last max-pooling layer (''pool5'') present in the VGG16 CNN has been removed. The image fed into the CNN is pre-shrunk to a 224x224x3 tensor(3 being the Red, Green, and Blue channels). The image passes through 2 pooling layers and 2 convolutional layers. Since, the features extracted after each operation are to be preserved and fused later on, at each of these four steps, the idea is to have a tensor with a common width of 512; so the output tensor at pool2 is convolved with 4 3x3x128 filters and the output tensor at pool3 is convolved with 2 3x3x256 filters. The skip connections from the four layers allow the CNN to extract low-level edge and corner features (helps to follow the object's boundaries) as well as boundary/semantic information about the instances (helps to identify the object). Finally, a 3x3 convolution applied along with a ReLU non-linearity results in a 28x28x128 tensor that contains semantic information pertinent to the image frame and is taken as an input by the RNN.<br />
<br />
2. '''RNN - 2 Layer ConvLSTM'''<br />
<br />
The RNN is employed to capture information about the previous vertices in the time-series. Specifically, a Convolutional LSTM is used as a decoder. The ConvLSTM allows preservation of the spatial information in 2D received from CNN and reduces the number of parameters compared to a Fully Connected RNN. The polygon is modeled with a kernel size of 3x3 and 16 channels outputting a vertex at each time step. The ConvLSTM gets as input a tensor step t which<br />
concatenates 4 features: the CNN feature representation of the image, one-hot encoding of the previous predicted vertex and the vertex predicted<br />
from two time steps ago, as well as the one-hot encoding of the first predicted vertex. <br />
<br />
The Convolutional LSTM computes the hidden state <math display = "inline">h_t</math> given the input <math display = "inline">x_t</math> based on the following equations:<br />
<center><br />
<math display="block"><br />
\begin{pmatrix}<br />
i_t \\<br />
f_t \\<br />
o_t \\<br />
g_t \\<br />
\end{pmatrix}<br />
= W_h * h_{t-1} + W_x * x_t + b<br />
</math><br />
<br />
<math display="block"><br />
c_t = \sigma(f_t) \bigodot c_{t-1} + \sigma(i_t) \bigodot tanh(g_t)<br />
</math><br />
<br />
<math display="block"><br />
h_t = \sigma(o_t) \bigodot tanh(c_t)<br />
</math><br />
</center><br />
where <math display = "inline">i, f, o</math> denote the input, forget, and output gate, <math display = "inline">h</math> is the hidden state and <math display = "inline">c</math> is the cell state. Also, <math display = "inline">\sigma</math> denotes the sigmoid function, <math display = "inline">\bigodot</math> indicates an element-wise product and <math display = "inline">*</math> a convolution. <math display = "inline">W_h</math> denotes the hidden-to-state convolution kernel and <math display = "inline">W_x</math> the input-to-state convolution kernel.<br />
<br />
The authors have treated the vertex prediction task as a classification task in that the location of the vertices is through a one-hot representation of dimension DxD + 1 (D chosen to be 28 by the authors in tests). The one additional dimension is the storage cue for loop closure for the polygon. Given that, the one-hot representation of the two previously predicted vertices and the first vertex are taken in as an input, a clockwise (or for that reason any fixed direction) direction can be forced for the creation of the polygon. Coming back to the prediction of the first vertex, as polygon is a circle, any vertex of a polygon can be used as a starting point. Therefore the authors treat the starting point as special, and this is done through further modification of the CNN by adding two DxD layers with one branch predicting object instance boundaries while the other takes in this output as well as the image features to predict vertices of the polygon. The boundaries and vertices prediction are being treated as binary classification problem in each cell in the output grid. This CNN is trained separately. Here, <math display = "inline">y_t</math> denotes the one-hot encoding of the vertex and is the output at time step <math>t</math>.<br />
<br />
== Training ==<br />
<br />
The training of the model is done as follows:<br />
<br />
1. Cross-entropy is used for the RNN loss function. To avoid over-penalizing of mispredictions that are close to the ground-truth vertex, non-zero probability mass are assigned to locations which are within a distance of 2 in D × D output grid.<br />
<br />
2. The typical training regime, where the model make predictions at each time step but feed in ground-truth vertex information to the next, is followed. Instead of Stochastic Gradient Descent, Adam is used for optimization: batch size = 8, learning rate = 1e^-4 (learning rate decays after 10 epochs by a factor of 10) This choice of optimizer makes it easier for development, but switching back to SGD may get better experimental results due to convergence problems of Adam.<br />
<br />
3. For the first vertex prediction, the modified CNN mentioned previously, is trained using a multi-task cost function. In particular, the authors used the logistic loss for every location in the grid.<br />
<br />
The reported time for training is one day on a Nvidia Titan-X GPU.<br />
<br />
The resolution of the polygon is 28 x 28, based on the downsampling factor and ConvLSTM resolution. They simplified the polygon by removing vertices on the grid line and the same vertices that fall in the same grid. They also randomly flipped images, enlarged original bounding boxes and randomly selected the starting vertex of the polygon notation as their data augmentation process.<br />
<br />
== Importance of Human Annotator in the Loop ==<br />
<br />
The model allows for the prediction at a given time step to be corrected and this corrected vertex is then fed into the next time step of the RNN, effectively rejecting the network predicted vertex. This has the simple effect of putting the model "back on the right track". Note that this is only possible due to the adoption of the RNN architecture i.e. the inherent nature of the RNN to accept previous outputs allows incorporation of the user's judgement. The typical inference time as quoted by the paper is 250ms per object.<br />
<br />
= Results =<br />
<br />
== Evaluation Metrics ==<br />
<br />
The evaluation of the model performance was conducted based on the Cityscapes and KITTI Datasets. There are two metrics used for evaluation:<br />
<br />
1. '''IoU''': The standard Intersection over Union (IoU) measure is used for comparison. In add The calculation for IoU takes both the predicted and ground-truth object boundaries. The intersection (area contained in both boundaries at once) is divided by the union (the area contained by at least one, or both, of the boundaries). A low score of this metric would mean that there is little overlap between the boundaries, or large areas on non-overlap, and a score of 1.0 would indicate that the two boundaries contain the same area.<br />
<br />
2. '''Number of Clicks''': To evaluate the speed up factor, the checkerboard distance is used to measure the distance between the ground truth (GT) and the output of the Polygon RNN. A set of distance thresholds are set <math display = "inline">T &isin; [1,2,3,4]</math> and if the distance exceeds the particular threshold, the correction is made by an annotator to match the GT and the '''Number of Clicks''' is used to evaluate the speed up factor.<br />
<br />
== Baseline Techniques ==<br />
<br />
1. '''SharpMask''': a 50 layer ResNet considered as the state of the art annotation method.<br />
<br />
2. '''DeepMask''': a build-up on the 50 layer ResNet with an addition of another CNN.<br />
<br />
3. '''Dilation10''': another simple technique using purely convolutional operations.<br />
<br />
4. '''SquareBox''': a simple technique where an entire bounding box is labeled as an object<br />
<br />
== Quantitative Results ==<br />
<br />
We report the IoU metric in Table<br />
1. The Polygon RNN method outperforms the baselines in 6 out of the 8 categories and has a mean IoU greater than all of the baselines. Particularly, in the car, person, and rider categories, a 12%, 7%, and 6% higher performance than SharpMask is achieved.<br />
<br />
[[File:Table_1_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]<br />
<br />
In addition, with the help of the annotator, the speedup factor was 7.3 times with under 5 clicks which the authors claim is the main advantage of this method.<br />
<br />
[[File:Table_0_Neel.JPG | 800px|thumb|center|Table 2: IoU performance on Cityscapes data with annotator intervention.]]<br />
<br />
The method also works well with other datasets such as KITTI:<br />
<br />
[[File:Table_2_Neel.JPG | 800px|thumb|center|Table 3: IoU performance on KITTI data.]]<br />
<br />
== Effect of object size ==<br />
In Fig. 4, we see how our model performs w.r.t baselines on different instance sizes. For small instances, our model performs significantly better than the baselines. For larger objects, the baselines have an advantage due to the larger output resolution. <br />
<br />
[[File:IoU_vs_size_of_instance.PNG | 500px|thumb|center|Fig 4: IoU_vs_size_of_instance.]]<br />
<br />
== Qualitative Results ==<br />
<br />
In addition, most of the comparisons with human annotators show that the method is at par with human-level annotation.<br />
<br />
<gallery widths=500px heights=500px perrow=2 mode="packed"><br />
File:Figure_3_Neel.JPG|Figure 6: Qualitative results: comparison with human annotator.|alt=alt language<br />
File:Figure_4_Neel.JPG|Figure 7: Qualitative results: comparison with human annotator.|alt=alt language<br />
</gallery><br />
<br />
=Conclusion=<br />
<br />
The important conclusions from this paper are:<br />
<br />
1. The paper presented a powerful generic annotation tool for modelling complex annotations as a simple polygon that works on different unseen datasets. <br />
<br />
2. Significant improvement in annotation time can be achieved with the Polygon-RNN method itself (speed-up factor of 4.74).<br />
<br />
3. However, the flexibility of having inputs from a human annotator helps increase the IoU for a certain range of clicks.<br />
<br />
4. The model architecture has a down-sampling factor of 16 and the final output resolution and accuracy is sensitive to object size.<br />
<br />
5. Another downside of the model architecture is that training time is increased due to the training of the CNN for the first vertex.<br />
<br />
=Critique=<br />
<br />
1. With the human annotator in the loop, the model speeds up the process of annotation by over 7 times which is perhaps a big cost and time cutting improvement for companies.<br />
<br />
2. Given that this model uses the VGG16 architecture compared to the 50 layer ResNet in SharpMask, this method is quite efficient.<br />
<br />
3. This paper requires training of an entire CNN for the first vertex and is inefficient in that sense as it introduces additional parameters adding to the computation time and resource demand.<br />
<br />
4. The baseline methods have an upper hand compared to this model when it comes to larger objects since the nature of the down-scaled structure adopted by this model.<br />
<br />
5. In terms of future work, elimination of the additional CNN for the first vertex as well as an enhanced architecture to remain insensitive to the size of the object to be annotated should be implemented.<br />
<br />
6. Compared to other models, the model was shown to not perform as well for larger objects (see table 3). This is likely due to the fact that vertex location determination is done in a highly compressed (28x28) representation compared to the input image(224x224). For larger objects, bounding boxes are larger. Each vertex represents many pixels. When up-converted back to the input image/bounding box size these may lead to errors especially when considering a very precise evaluation metric (intersection over union) is used. Potentially, the results can be improved by considering a higher resolution for the internal representation or one that scales with the size of the bounding.<br />
<br />
7. While the model outperforms the baseline for certain categories of object, it is surprising that it underperforms in categories such as 'bus' and 'train'. With human annotators in the loop, one would expect the model to outperform in all categories.<br />
<br />
8. One of the major contributions of this paper lies on the fact that this paper presents a method that does have an applicable value in the real world. In the paper, it does show that it can greatly reduce the human labeling efforts, and with human collaboration, this algorithm can help us tackle the image labeling problem much more efficiently. However, it does not provide the theoretical explanation that why would an RNN work better than a CNN in this case, a more in-depth analysis would make the paper better.<br />
<br />
=Code=<br />
# [https://github.com/AlexMa011/pytorch-polygon-rnn] (unofficial)<br />
# Code for an updated version of the model is available at [https://github.com/fidler-lab/polyrnn-pp] (official)</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias&diff=42166Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias2018-12-01T00:05:13Z<p>Z43ma: format update</p>
<hr />
<div>==Introduction==<br />
<br />
The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.<br />
<br />
This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models are not good enough and tend to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets. <br />
<br />
Like every other process, the process of collecting real-world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors: <br />
<br />
1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes<br />
<br />
2. Collect training data in 6 different homes and testing data in 3 homes<br />
<br />
3. Propose a method for modelling the noise in the labelled data<br />
<br />
4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation<br />
<br />
[[File:aa1.PNG|600px|thumb|center|]]<br />
<br />
==Overview==<br />
<br />
This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.<br />
<br />
As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.<br />
<br />
As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.<br />
<br />
==Learning on low-cost robot data==<br />
<br />
This paper uses a patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.<br />
<br />
===Grasping Formulation===<br />
<br />
Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used. <br />
<br />
Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.<br />
<br />
(Note: On Cross Entropy:<br />
<br />
If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool. This is optimal, in that we can't encode the symbols using fewer bits on average.<br />
In contrast, cross entropy is the number of bits we'll need if we encode symbols from <math>y</math> using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:<br />
<br />
\begin{align}<br />
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}<br />
\end{align}<br />
<br />
Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)<br />
<br />
===Modeling noise as latent variable===<br />
<br />
In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2: <br />
<br />
<br />
[[File:aa2.PNG|600px|thumb|center|]]<br />
<br />
The conventional approach models the grasp success probability for a given image patch at a given angle where the variables of the environment which can introduce noise in the system is generally insignificant, due to the high accuracy of expensive, commercial robots. However, in the low cost setting with multiple robots collecting data in parallel, it becomes an important consideration for learning. <br />
<br />
The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.<br />
<br />
The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:<br />
<br />
<br />
\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]<br />
<br />
<br />
Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.<br />
<br />
===Learning the latent noise model===<br />
<br />
This section concerns what be the inputs to the NMN network should be and how should the inputs can be trained. The authors assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. Apart from the patch <math> I_{P} </math> and grasp information <math>(x, y, θ)</math>, they use information like image of the entire scene, ID of the robot and the location of the raw pixel. They argue that the image of the full scene could contain some essential information about the system such as the relative location of camera to the ground which may change over the lifetime of the robot. The identification number of the robot might give cues about errors specific to a particular hardware. Finally, the raw pixels of execution contain calibration specific information, since calibration error is coupled with pixel location, since least squares fit are used to to compute calibration parameters.<br />
<br />
They used direct optimization to learn both NMN and GPN with noisy labels. However, explicit labels are not available to train NMN but the latent variable <math>z</math> can be estimated using a technique such as Expectation-Maximization. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label <math>g</math>.<br />
<br />
===Training details===<br />
<br />
They implemented their model in PyTorch and fine tuned a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. This passes through a series of three fully connected layers and a SoftMax layer to convert the correct patch predictions to a probability distribution. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one. The angle predictions for all the patches are passed through a sigmoid activation at the end to obtain grasp success probability for a specific patch at a specific angle.<br />
<br />
The training of the network takes place in two stages. It starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously in an end-to-end fashion for the other 25 epochs.<br />
<br />
This two-stage approach is crucial for effective training of their networks, without which NMN trivially selects the same patch irrespective of the input. The optimizer used for training is Adam [16].<br />
<br />
==Results==<br />
<br />
In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.<br />
<br />
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.<br />
<br />
[[File:aa3.PNG|600px|thumb|center|]]<br />
<br />
To evaluate their approach in a more quantitative way, they used three test settings:<br />
<br />
- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.<br />
<br />
- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.<br />
<br />
- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.<br />
<br />
They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).<br />
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.<br />
<br />
===Experiment 1: Performance on held-out data===<br />
<br />
Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.<br />
<br />
[[File:aa4.PNG|600px|thumb|center|]]<br />
<br />
===Experiment 2: Performance on Real LCA Robot===<br />
<br />
In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high-quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high-quality sensing.<br />
<br />
[[File:aa5.PNG|600px|thumb|center|]]<br />
<br />
===Performance on Real Sawyer===<br />
<br />
To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for benchmarking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in a different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.<br />
<br />
[[File:aa6.PNG|600px|thumb|center|]]<br />
<br />
[[File:aa7.PNG|600px|thumb|center|]]<br />
<br />
==Related work==<br />
<br />
Over the last few years, the interest of scaling up robot learning with large-scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large-scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.<br />
<br />
Furthermore, since grasping is one of the basic problems in robotics, there were some efforts to improve grasping. Classical approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high-quality depth as input which seems to be a barrier for practical use of robots in real environments. High-quality depth sensing means a high cost to implement in hardware and thus is a barrier for practical use.<br />
<br />
Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low-cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.<br />
<br />
Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.<br />
<br />
==Conclusion==<br />
<br />
All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their framework performed 33% better than a baseline DexNet model, which struggled with the typically poor depth sensing in common household environments with a lot of natural light. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.<br />
<br />
==Critiques==<br />
<br />
This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.<br />
<br />
Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is, in fact, a fine-tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.<br />
<br />
[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]<br />
<br />
<br />
The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.<br />
<br />
They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.<br />
<br />
==References==<br />
<br />
#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.<br />
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.<br />
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor-critic for image-based robot learning." Robotics Science and Systems, 2018.<br />
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.<br />
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.<br />
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.<br />
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.<br />
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419<br />
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.<br />
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.<br />
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.<br />
#Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. Learning to control a low-cost manipulator using data-efficient reinforcement learning. RSS, 2011.<br />
#David F Nettleton, Albert Orriols-Puig, and Albert Fornells. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial intelligence review, 33(4):275–306, 2010.<br />
#Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014.<br />
#Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.<br />
#Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias&diff=42165Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias2018-12-01T00:04:01Z<p>Z43ma: grammer update</p>
<hr />
<div>==Introduction==<br />
<br />
The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.<br />
<br />
This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models are not good enough and tend to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets. <br />
<br />
Like every other process, the process of collecting real-world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors: <br />
<br />
1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes<br />
<br />
2. Collect training data in 6 different homes and testing data in 3 homes<br />
<br />
3. Propose a method for modelling the noise in the labelled data<br />
<br />
4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation<br />
<br />
[[File:aa1.PNG|600px|thumb|center|]]<br />
<br />
==Overview==<br />
<br />
This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.<br />
<br />
As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.<br />
<br />
As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.<br />
<br />
==Learning on low-cost robot data==<br />
<br />
This paper uses a patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.<br />
<br />
===Grasping Formulation===<br />
<br />
Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used. <br />
<br />
Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.<br />
<br />
(Note: On Cross Entropy:<br />
<br />
If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool. This is optimal, in that we can't encode the symbols using fewer bits on average.<br />
In contrast, cross entropy is the number of bits we'll need if we encode symbols from y using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:<br />
<br />
\begin{align}<br />
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}<br />
\end{align}<br />
<br />
Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)<br />
<br />
===Modeling noise as latent variable===<br />
<br />
In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2: <br />
<br />
<br />
[[File:aa2.PNG|600px|thumb|center|]]<br />
<br />
The conventional approach models the grasp success probability for a given image patch at a given angle where the variables of the environment which can introduce noise in the system is generally insignificant, due to the high accuracy of expensive, commercial robots. However, in the low cost setting with multiple robots collecting data in parallel, it becomes an important consideration for learning. <br />
<br />
The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.<br />
<br />
The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:<br />
<br />
<br />
\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]<br />
<br />
<br />
Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.<br />
<br />
===Learning the latent noise model===<br />
<br />
This section concerns what be the inputs to the NMN network should be and how should the inputs can be trained. The authors assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. Apart from the patch <math> I_{P} </math> and grasp information (x, y, θ), they use information like image of the entire scene, ID of the robot and the location of the raw pixel. They argue that the image of the full scene could contain some essential information about the system such as the relative location of camera to the ground which may change over the lifetime of the robot. The identification number of the robot might give cues about errors specific to a particular hardware. Finally, the raw pixels of execution contain calibration specific information, since calibration error is coupled with pixel location, since least squares fit are used to to compute calibration parameters.<br />
<br />
They used direct optimization to learn both NMN and GPN with noisy labels. However, explicit labels are not available to train NMN but the latent variable <math>z</math> can be estimated using a technique such as Expectation-Maximization. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label <math>g</math>.<br />
<br />
===Training details===<br />
<br />
They implemented their model in PyTorch and fine tuned a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. This passes through a series of three fully connected layers and a SoftMax layer to convert the correct patch predictions to a probability distribution. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one. The angle predictions for all the patches are passed through a sigmoid activation at the end to obtain grasp success probability for a specific patch at a specific angle.<br />
The training of the network takes place in two stages. It starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously in an end-to-end fashion for the other 25 epochs.<br />
This two-stage approach is crucial for effective training of their networks, without which NMN trivially selects the same patch irrespective of the input. The optimizer used for training is Adam [16].<br />
<br />
==Results==<br />
<br />
In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.<br />
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.<br />
<br />
[[File:aa3.PNG|600px|thumb|center|]]<br />
<br />
To evaluate their approach in a more quantitative way, they used three test settings:<br />
<br />
- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.<br />
<br />
- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.<br />
<br />
- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.<br />
<br />
They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).<br />
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.<br />
<br />
===Experiment 1: Performance on held-out data===<br />
<br />
Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.<br />
<br />
[[File:aa4.PNG|600px|thumb|center|]]<br />
<br />
===Experiment 2: Performance on Real LCA Robot===<br />
<br />
In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high-quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high-quality sensing.<br />
<br />
[[File:aa5.PNG|600px|thumb|center|]]<br />
<br />
===Performance on Real Sawyer===<br />
<br />
To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for benchmarking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in a different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.<br />
<br />
[[File:aa6.PNG|600px|thumb|center|]]<br />
<br />
[[File:aa7.PNG|600px|thumb|center|]]<br />
<br />
==Related work==<br />
<br />
Over the last few years, the interest of scaling up robot learning with large-scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large-scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.<br />
<br />
Furthermore, since grasping is one of the basic problems in robotics, there were some efforts to improve grasping. Classical approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high-quality depth as input which seems to be a barrier for practical use of robots in real environments. High-quality depth sensing means a high cost to implement in hardware and thus is a barrier for practical use.<br />
<br />
Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low-cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.<br />
<br />
Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.<br />
<br />
==Conclusion==<br />
<br />
All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their framework performed 33% better than a baseline DexNet model, which struggled with the typically poor depth sensing in common household environments with a lot of natural light. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.<br />
<br />
==Critiques==<br />
<br />
This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.<br />
<br />
Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is, in fact, a fine-tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.<br />
<br />
[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]<br />
<br />
<br />
The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.<br />
<br />
They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.<br />
<br />
==References==<br />
<br />
#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.<br />
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.<br />
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor-critic for image-based robot learning." Robotics Science and Systems, 2018.<br />
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.<br />
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.<br />
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.<br />
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.<br />
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419<br />
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.<br />
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.<br />
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.<br />
#Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. Learning to control a low-cost manipulator using data-efficient reinforcement learning. RSS, 2011.<br />
#David F Nettleton, Albert Orriols-Puig, and Albert Fornells. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial intelligence review, 33(4):275–306, 2010.<br />
#Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014.<br />
#Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.<br />
#Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias&diff=42164Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias2018-12-01T00:02:17Z<p>Z43ma: some details of why certain features are included</p>
<hr />
<div>==Introduction==<br />
<br />
The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.<br />
<br />
This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models are not good enough and tend to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets. <br />
<br />
Like every other process, the process of collecting real-world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors: <br />
<br />
1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes<br />
<br />
2. Collect training data in 6 different homes and testing data in 3 homes<br />
<br />
3. Propose a method for modelling the noise in the labelled data<br />
<br />
4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation<br />
<br />
[[File:aa1.PNG|600px|thumb|center|]]<br />
<br />
==Overview==<br />
<br />
This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.<br />
<br />
As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.<br />
<br />
As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.<br />
<br />
==Learning on low-cost robot data==<br />
<br />
This paper uses a patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.<br />
<br />
===Grasping Formulation===<br />
<br />
Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used. <br />
<br />
Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.<br />
<br />
(Note: On Cross Entropy:<br />
<br />
If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool. This is optimal, in that we can't encode the symbols using fewer bits on average.<br />
In contrast, cross entropy is the number of bits we'll need if we encode symbols from y using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:<br />
<br />
\begin{align}<br />
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}<br />
\end{align}<br />
<br />
Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)<br />
<br />
===Modeling noise as latent variable===<br />
<br />
In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2: <br />
<br />
<br />
[[File:aa2.PNG|600px|thumb|center|]]<br />
<br />
The conventional approach models the grasp success probability for a given image patch at a given angle where the variables of the environment which can introduce noise in the system is generally insignificant, due to the high accuracy of expensive, commercial robots. However, in the low cost setting with multiple robots collecting data in parallel, it becomes an important consideration for learning. <br />
<br />
The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.<br />
<br />
The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:<br />
<br />
<br />
\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]<br />
<br />
<br />
Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.<br />
<br />
===Learning the latent noise model===<br />
<br />
This section concerns what be the inputs to the NMN network should be and how should the inputs can be trained. The authors assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. Apart from the patch <math> I_{P} </math> and grasp information (x, y, θ), they use information like image of the entire scene, ID of the robot and the location of the raw pixel. They argue that the image of the full scene could contain some essential information about the system such as the relative location of camera to the ground which may change over the lifetime of the robot. The identification number of the robot might give cues about errors specific to a particular hardware. Finally, the raw pixels of execution contain calibration specific information, since calibration error is coupled with pixel location, since least squares fit are used to to compute calibration parameters.<br />
<br />
They used direct optimization to learn both NMN and GPN with noisy labels. However, explicit labels are not available to train NMN but the latent variable <math>z</math> can be estimated using a technique such as Expectation-Maximization. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label <math>g</math>.<br />
<br />
===Training details===<br />
<br />
They implemented their model in PyTorch and fine tuned a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. This passes through a series of three fully connected layers and a SoftMax layer to convert the correct patch predictions to a probability distribution. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one. The angle predictions for all the patches are passed through a sigmoid activation at the end to obtain grasp success probability for a specific patch at a specific angle.<br />
The training of the network takes place in two stages. It starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously in an end-to-end fashion for the other 25 epochs.<br />
This two-stage approach is crucial for effective training of their networks, without which NMN trivially selects the same patch irrespective of the input. The optimizer used for training is Adam [16].<br />
<br />
==Results==<br />
<br />
In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.<br />
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.<br />
<br />
[[File:aa3.PNG|600px|thumb|center|]]<br />
<br />
To evaluate their approach in a more quantitative way, they used three test settings:<br />
<br />
- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.<br />
<br />
- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.<br />
<br />
- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.<br />
<br />
They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).<br />
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.<br />
<br />
===Experiment 1: Performance on held-out data===<br />
<br />
Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.<br />
<br />
[[File:aa4.PNG|600px|thumb|center|]]<br />
<br />
===Experiment 2: Performance on Real LCA Robot===<br />
<br />
In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high-quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high-quality sensing.<br />
<br />
[[File:aa5.PNG|600px|thumb|center|]]<br />
<br />
===Performance on Real Sawyer===<br />
<br />
To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for benchmarking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in a different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.<br />
<br />
[[File:aa6.PNG|600px|thumb|center|]]<br />
<br />
[[File:aa7.PNG|600px|thumb|center|]]<br />
<br />
==Related work==<br />
<br />
Over the last few years, the interest of scaling up robot learning with large-scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large-scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.<br />
<br />
Furthermore, since grasping is one of the basic problems in robotics, there were some efforts to improve grasping. Classical approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high-quality depth as input which seems to be a barrier for practical use of robots in real environments. High-quality depth sensing means a high cost to implement in hardware and thus is a barrier for practical use.<br />
<br />
Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low-cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.<br />
<br />
Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.<br />
<br />
==Conclusion==<br />
<br />
All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their framework performed 33% better than a baseline DexNet model, which struggled with the typically poor depth sensing in common household environments with a lot of natural light.. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.<br />
<br />
==Critiques==<br />
<br />
This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.<br />
<br />
Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is, in fact, a fine-tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.<br />
<br />
[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]<br />
<br />
<br />
The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.<br />
<br />
They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.<br />
<br />
==References==<br />
<br />
#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.<br />
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.<br />
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor-critic for image-based robot learning." Robotics Science and Systems, 2018.<br />
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.<br />
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.<br />
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.<br />
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.<br />
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419<br />
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.<br />
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.<br />
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.<br />
#Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. Learning to control a low-cost manipulator using data-efficient reinforcement learning. RSS, 2011.<br />
#David F Nettleton, Albert Orriols-Puig, and Albert Fornells. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial intelligence review, 33(4):275–306, 2010.<br />
#Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014.<br />
#Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.<br />
#Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=conditional_neural_process&diff=42163conditional neural process2018-11-30T23:53:36Z<p>Z43ma: Format update.</p>
<hr />
<div>== Motivation ==<br />
<br />
Deep neural networks are good at function approximations, yet they are typically trained from scratch for each new function. While Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs are computationally expensive, and it can be hard to design appropriate priors. Hence the authors propose a propose a family of neural models called, Conditional Neural Processes (CNPs), that combine the benefits of both. <br />
<br />
== Introduction ==<br />
<br />
To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive. <br />
<br />
The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.<br />
<br />
== Model ==<br />
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is to minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.<br />
<br />
Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} \subset X</math> of unlabelled points.<br />
<br />
P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>. <br />
<br />
A good example is given by the authors, consider a random 1-dimensional function <math>f ∼ P</math> defined on the real line (i.e., <math>X := R</math>, <math>Y := R</math>). <math>O</math> would constitute <math>n</math> observations of <math>f</math>’s value <math>y_i</math> at different locations <math>x_i</math> on the real line. Given these observations, we are interested in predicting <math>f</math>’s value at new locations on the real line. <br />
<br />
A common assumption made on P is that all function evaluations of <math display="inline"> f </math> is Gaussian distributed. The random functions class is called Gaussian Processes (GPs). This framework of the stochastic process allows a model to be data efficient, however, it's hard to get appropriate priors and stochastic processes are expensive in computation, scaling poorly with <math>n</math> and <math>m</math>. One of the examples is GPs, which has running time <math>O(n+m)^3</math>.<br />
<br />
[[File:001.jpg|300px|center]]<br />
<br />
== Conditional Neural Process ==<br />
<br />
Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.<br />
<br />
CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>, given a set of observations <math display="inline">O</math>. For stochastic processs, the authors assume that <math display="inline">Q_{\theta}</math> is invariant to permutations, and <math display="inline">Q_\theta(f(T) | O, T)= Q_\theta(f(T') | O, T')=Q_\theta(f(T) | O', T) </math> when <math> O', T'</math> are permutations of <math display="inline">O</math> and <math display="inline">T </math>. In this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure, which is the easiest way to ensure a valid stochastic process. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>. Moreover, this framework can be extended to non-factored distributions.<br />
<br />
In detail, the following architecture is used.<br />
<br />
<math display="inline">r_i = h_\theta(x_i, y_i)</math> &forall; <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math><br />
<br />
<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math><br />
<br />
<math display="inline">\Phi_i = g_\theta</math> &forall; <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math><br />
<br />
Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.<br />
<br />
We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly<br />
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution<br />
P given a set of observations. The authors let <math display="inline"> f \sim P</math>, <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{n-1}</math>, and N ~ uniform[0, 1, ..... ,n-1]. Subset <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{N}</math> that is first N elements of <math display="inline">O</math> is regarded as condition. The negative conditional log probability is given by<br />
\[\mathcal{L}(\theta)=-\mathbb{E}_{f \sim p}[\mathbb{E}_{N}[\log Q_\theta(\{y_i\}_{i = 0} ^{n-1}|O_{N}, \{x_i\}_{i = 0} ^{n-1})]]\]<br />
Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed <br />
and unobserved values. In practice, Monte Carlo estimates of the gradient of this loss is taken by sampling <math display="inline">f</math> and <math display="inline">N</math>. <br />
<br />
This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately<br />
intended to summarize their empirical experience. Still, we emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.<br />
<br />
In summary,<br />
<br />
1. A CNP is a conditional distribution over functions<br />
trained to model the empirical conditional distributions<br />
of functions <math display="inline">f \sim P</math>.<br />
<br />
2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.<br />
<br />
3. A CNP is scalable, achieving a running time complexity<br />
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math><br />
observations.<br />
<br />
== Related Work ==<br />
<br />
===Gaussian Process Framework===<br />
<br />
A Gaussian Process (GP) is a non-parametric method for regression, used extensively for regression and classification problems in the machine learning community. A GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.<br />
A standard approach is to model data as <math>y = m(X, φ) + \epsilon</math><br />
where <math>m</math> is the mean function with parameter vector <math>φ</math>, and <math>\epsilon</math> represents independent and identically distributed (i.i.d.) Gaussian noise: <math>N\sim (0,\sigma^2)</math><br />
<br />
For more info on Gaussian Process Framework:<br />
[https://arxiv.org/abs/1506.07304 A Gaussian process framework for modeling instrumental systematics: application to transmission spectroscopy]<br />
<br />
Several papers attempt to address various issues with GPs. These include:<br />
* Using sparse GPs to aid in scaling (Snelson & Ghahramani, 2006)<br />
* Using Deep GPs to achieve more expressiveness (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017)<br />
* Using neural networks to learn more expressive kernels (Wilson et al., 2016)<br />
<br />
A Python resource for Gaussian Process Framework implementation: [https://github.com/SheffieldML/GPyimplementation Gaussian Process Framework in Python]<br />
<br />
The goal of this paper is to incorporate ideas from standard neural networks with Gaussian processes in order to overcome drawbacks of both. Bayesian techniques work better with less data, but complex Bayesian networks become intractable on even moderate sized data sizes. NNs on the other hand, cannot make use of prior knowledge and often have to be retrained from scratch. Without sufficient data, they also perform poorly. Combining both frameworks, we get Conditional Neural Processes serves to learn the kernels of the Gaussian Process through neural networks and uses these learned kernels on a framework similar to GPs for prediction.<br />
<br />
===Meta Learning===<br />
<br />
Meta-Learning attempts to allow neural networks to learn more generalizable functions, as opposed to only approximating one function. This can be done by learning deep generative models which can do few-shot estimations of data. This can be implemented with attention mechanisms (Reed et al., 2017) or additional memory units in a VAE model (Bornschein et al., 2017). Another successful latent variable approach is to explicitly condition on some context during inference (J. Rezende et al., 2016). Given the generative nature of these models they are usually applied to image generation tasks, but models that include a conditioning class-variable can be used for classification as well. Recently meta-learning has also been applied to a wide range of tasks like RL (Wang et al., 2016; Finn et al., 2017) or program induction (Devlin et al., 2017).<br />
<br />
Classification is another common task in meta-learning, few-shot classification algorithms usually rely on some distance metric in feature space to compare target images and the observations (Koch et al., 2015), (Santoro et al., 2016).. Matching networks(Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs. In their case features of samples are compared with target features using an attention kernel. At a higher level one can interpret this model as a CNP where the aggregator is just the concatenation over all input samples and the decoder <math>g</math> contains an explicitly defined distance kernel. In this sense matching networks are closer to GPs than to CNPs, since they require the specification of a distance kernel that CNPs learn from the data instead. In addition, as MNs carry out all- to-all comparisons they scale with <math> O(n × m) </math>, although they can be modified to have the same complexity of <math>O(n + m)</math> as CNPs (Snell et al., 2017).<br />
<br />
Another field in the meta-learning field is Neural architecture search. It requires the search algorithm to define three things: the search space, search strategy, and performance evaluation strategy. It is one of the most popular trends in the meta-learning field now. The idea is we can define some search space, and let algorithms help us decide what architecture and hyperparameters would be best for a particular task. Also, since evaluating a neural network is expensive(needs train the neural network first), it needs a well designed performance evaluation strategy to lower down the computational cost<br />
<br />
A model that is conceptually very similar to CNPs (and in particular the latent variable version) is the “neural statistician” paper (Edwards & Storkey, 2016) and the related variational homoencoder (Hewitt et al., 2018). As with the<br />
other generative models the neural statistician learns to estimate the density of the observed data but does not allow for targeted sampling at what we have been referring to as input positions <math>x_i</math>. Instead, one can only generate i.i.d. samples from the estimated density. Finally, the latest variant of Conditional Neural Process can also be seen as an approximated amortized version of Bayesian DL(Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017). For example, Gal & Ghahramani 2016 develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Their theory extracts information from existing models and gives us tools to model uncertainty.<br />
<br />
== Experimental Result I: Function Regression ==<br />
<br />
Classical 1D regression task that used as a common baseline for GP is the first example. <br />
They generated two different datasets that consisted of functions<br />
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset, the function switched at some random point. on the real line between two functions, each sampled with<br />
different kernel parameters. At every training step, they sampled a curve from the GP, select<br />
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three-layer MLP encoder h with a 128-dimensional output representation. The representations are aggregated into a single representation<br />
<math display="inline">r = \frac{1}{n} \sum r_i</math><br />
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer<br />
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer. <br />
<br />
Two examples of the regression results obtained for each<br />
of the datasets are shown in the following figure.<br />
<br />
[[File:007.jpg|300px|center]]<br />
<br />
They compared the model to the predictions generated by a GP with the correct<br />
hyperparameters, which constitutes an upper bound on our<br />
performance. Although the prediction generated by the GP<br />
is smoother than the CNP's prediction both for the mean<br />
and variance, the model is able to learn to regress from a few<br />
context points for both the fixed kernels and switching kernels.<br />
As the number of context points grows, the accuracy<br />
of the model improves and the approximated uncertainty<br />
of the model decreases. Crucially, we see the model learns<br />
to estimate its own uncertainty given the observations very<br />
accurately. Nonetheless, it provides a good approximation<br />
that increases in accuracy as the number of context points<br />
increases.<br />
Furthermore, the model achieves similarly good performance<br />
on the switching kernel task. This type of regression task<br />
is not trivial for GPs whereas in our case we only have to<br />
change the dataset used for training<br />
<br />
== Experimental Result II: Image Completion for Digits ==<br />
<br />
[[File:002.jpg|600px|center]]<br />
<br />
They also tested CNP on the MNIST dataset and use the test<br />
set to evaluate its performance. As shown in the above figure the<br />
model learns to make good predictions of the underlying<br />
digit even for a small number of context points. Crucially,<br />
when conditioned only on one non-informative context point the model’s prediction corresponds<br />
to the average overall MNIST digits. As the number<br />
of context points increases the predictions become more<br />
similar to the underlying ground truth. This demonstrates<br />
the model’s capacity to extract dataset specific prior knowledge.<br />
It is worth mentioning that even with a complete set<br />
of observations, the model does not achieve pixel-perfect<br />
reconstruction, as we have a bottleneck at the representation<br />
level.<br />
Since this implementation of CNP returns factored outputs,<br />
the best prediction it can produce given limited context<br />
information is to average over all possible predictions that<br />
agree with the context. An alternative to this is to add<br />
latent variables in the model such that they can be sampled<br />
conditioned on the context to produce predictions with high<br />
probability in the data distribution. <br />
<br />
<br />
An important aspect of the model is its ability to estimate<br />
the uncertainty of the prediction. As shown in the bottom<br />
row of the above figure, as they added more observations, the variance<br />
shifts from being almost uniformly spread over the digit<br />
positions to being localized around areas that are specific<br />
to the underlying digit, specifically its edges. Being able to<br />
model the uncertainty given some context can be helpful for<br />
many tasks. One example is active exploration, where the<br />
model has a choice over where to observe.<br />
They tested this by<br />
comparing the predictions of CNP when the observations<br />
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active<br />
exploration, but it already produces better prediction results<br />
then selecting the conditioning points at random.<br />
<br />
== Experimental Result III: Image Completion for Faces ==<br />
<br />
<br />
[[File:003.jpg|400px|center]]<br />
<br />
<br />
They also applied CNP to CelebA, a dataset of images of<br />
celebrity faces and reported performance obtained on the<br />
test set.<br />
<br />
As shown in the above figure our model is able to capture<br />
the complex shapes and colors of this dataset with predictions<br />
conditioned on less than 10% of the pixels being<br />
already close to the ground truth. As before, given a few contexts<br />
points the model averages over all possible faces, but as<br />
the number of context pairs increases the predictions capture<br />
image-specific details like face orientation and facial<br />
expression. Furthermore, as the number of context points<br />
increases the variance is shifted towards the edges in the<br />
image.<br />
<br />
[[File:004.jpg|400px|center]]<br />
<br />
An important aspect of CNPs demonstrated in the above figure is<br />
it's flexibility not only in the number of observations and<br />
targets it receives but also with regards to their input values.<br />
It is interesting to compare this property to GPs on one hand,<br />
and to trained generative models (van den Oord et al., 2016;<br />
Gregor et al., 2015) on the other hand.<br />
The first type of flexibility can be seen when conditioning on<br />
subsets that the model has not encountered during training.<br />
Consider conditioning the model on one half of the image,<br />
fox example. This forces the model to not only predict the pixel<br />
values according to some stationary smoothness property of<br />
the images, but also according to global spatial properties,<br />
e.g. symmetry and the relative location of different parts of<br />
faces. As seen in the first row of the figure, CNPs are able to<br />
capture those properties. A GP with a stationary kernel cannot<br />
capture this, and in the absence of observations would<br />
revert to its mean (the mean itself can be non-stationary but<br />
usually, this would not be enough to capture the interesting<br />
properties).<br />
<br />
In addition, the model is flexible with regards to the target<br />
input values. This means, e.g., we can query the model<br />
at resolutions it has not seen during training. We take a<br />
model that has only been trained using pixel coordinates of<br />
a specific resolution and predict at test time subpixel values<br />
for targets between the original coordinates. As shown in<br />
Figure 5, with one forward pass we can query the model at<br />
different resolutions. While GPs also exhibit this type of<br />
flexibility, it is not the case for trained generative models,<br />
which can only predict values for the pixel coordinates on<br />
which they were trained. In this sense, CNPs capture the best<br />
of both worlds – it is flexible in regards to the conditioning<br />
and prediction task and has the capacity to extract domain<br />
knowledge from a training set.<br />
<br />
[[File:010.jpg|400px|center]]<br />
<br />
<br />
They compared CNPs quantitatively to two related models:<br />
kNNs and GPs. As shown in the above table CNPs outperform<br />
the latter when a number of context points are small (empirically<br />
when half of the image or less is provided as context).<br />
When the majority of the image is given as context exact<br />
methods like GPs and kNN will perform better. From the table<br />
we can also see that the order in which the context points<br />
are provided is less important for CNPs, since providing the<br />
context points in order from top to bottom still results in<br />
good performance. Both insights point to the fact that CNPs<br />
learn a data-specific ‘prior’ that will generate good samples<br />
even when the number of context points is very small.<br />
<br />
== Experimental Result IV: Classification ==<br />
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. The authors used 1,200 randomly selected classes as their training set and the remainder as the testing data set.<br />
<br />
Additionally, to apply data augmentation the authors cropped the image from 32 × 32 to 28 × 28, applied small random<br />
translations and rotations to the inputs, and also increased<br />
the number of classes by rotating every character by 90<br />
degrees and defining that to be a new class. They generated<br />
the labels for an N-way classification task by choosing N<br />
random classes at each training step and arbitrarily assigning<br />
the labels <math>0, ..., N − 1</math> to each.<br />
<br />
<br />
[[File:008.jpg|400px|center]]<br />
<br />
Given that the input points are images, they modified the architecture<br />
of the encoder h to include convolution layers as<br />
mentioned in section 2. In addition, they only aggregated over<br />
inputs of the same class by using the information provided<br />
by the input label. The aggregated class-specific representations<br />
are then concatenated to form the final representation.<br />
Given that both the size of the class-specific representations<br />
and the number of classes is constant, the size of the final<br />
representation is still constant and thus the <math>O(n + m)</math><br />
runtime still holds.<br />
The results of the classification are summarized in the following table<br />
CNPs achieve higher accuracy than models that are significantly<br />
more complex (like MANN). While CNPs do not<br />
beat state of the art for one-shot classification our accuracy<br />
values are comparable. Crucially, they reached those values<br />
using a significantly simpler architecture (three convolutional<br />
layers for the encoder and a three-layer MLP for the<br />
decoder) and with a lower runtime of <math>O(n + m)</math> at test time<br />
as opposed to <math>O(nm)</math><br />
<br />
== Conclusion ==<br />
<br />
The paper introduced Conditional Neural Processes,<br />
a model that is both flexible at test time and has the<br />
capacity to extract prior knowledge from training data.<br />
<br />
The authors had demonstrated its ability to perform a variety of tasks<br />
including regression, classification and image completion.<br />
The paper compared CNP's to Gaussian Processes on one hand, and<br />
deep learning methods on the other, and also discussed the<br />
relation to meta-learning and few-shot learning.<br />
It is important to note that the specific CNP implementations<br />
described here are just simple proofs-of-concept and can<br />
be substantially extended, e.g. by including more elaborate<br />
architectures in line with modern deep learning advances.<br />
To summarize, this work can be seen as a step towards learning<br />
high-level abstractions, one of the grand challenges of<br />
contemporary machine learning. Functions learned by most<br />
Conditional Neural Processes<br />
conventional deep learning models are tied to a specific, constrained<br />
statistical context at any stage of training. A trained<br />
CNP is more general, in that it encapsulates the high-level<br />
statistics of a family of functions. As such it constitutes a<br />
high-level abstraction that can be reused for multiple tasks.<br />
In future work, they are going to explore how far these models can<br />
help in tackling the many key machine learning problems<br />
that seem to hinge on abstraction, such as transfer learning,<br />
meta-learning, and data efficiency.<br />
<br />
== Critiques ==<br />
<br />
This paper introduces a method, for reducing the computational complexity of the more famous Gaussian Processes model, but they have mentioned a complexity of O(n + m) which is almost the same order of RBF kernel GP. With respect to performances in a sequence of tasks, the authors have not made metric comparisons to GP methods to prove the superiority of their approach.<br />
<br />
It appears that the proposed model is effective in making accurate predictions using lower quality inputs. For example, a dataset with fewer data points or an image with fewer pixels. However, it is not clear whether the proposed algorithm can be trained with a smaller amount of input data.<br />
<br />
== Other Sources ==<br />
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]<br />
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]<br />
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]<br />
<br />
== Reference ==<br />
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative<br />
models with generative matching networks. arXiv<br />
preprint arXiv:1612.02192, 2016.<br />
<br />
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,<br />
D. Weight uncertainty in neural networks. arXiv preprint<br />
arXiv:1505.05424, 2015.<br />
<br />
Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.<br />
Variational memory addressing in generative models. In<br />
Advances in Neural Information Processing Systems, pp.<br />
3923–3932, 2017.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.<br />
<br />
Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and<br />
Kohli, P. Neural program meta-induction. In Advances in<br />
Neural Information Processing Systems, pp. 2077–2085,<br />
2017.<br />
<br />
Edwards, H. and Storkey, A. Towards a neural statistician.<br />
2016.<br />
<br />
Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning<br />
for fast adaptation of deep networks. arXiv<br />
preprint arXiv:1703.03400, 2017.<br />
<br />
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:<br />
Representing model uncertainty in deep learning.<br />
In international conference on machine learning, pp.<br />
1050–1059, 2016.<br />
<br />
Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards<br />
deep symbolic reinforcement learning. arXiv preprint<br />
arXiv:1609.05518, 2016.<br />
<br />
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and<br />
Wierstra, D. Draw: A recurrent neural network for image<br />
generation. arXiv preprint arXiv:1502.04623, 2015.<br />
<br />
Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The<br />
variational homoencoder: Learning to infer high-capacity<br />
generative models from few examples. 2018.<br />
<br />
J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,<br />
et al. One-shot generalization in deep generative models.<br />
In International Conference on Machine Learning, pp.<br />
1521–1529, 2016.<br />
<br />
Kingma, D. P. and Ba, J. Adam: A method for stochastic<br />
optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
Kingma, D. P. and Welling, M. Auto-encoding variational<br />
bayes. arXiv preprint arXiv:1312.6114, 2013.<br />
<br />
Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural<br />
networks for one-shot image recognition. In ICML Deep<br />
Learning Workshop, volume 2, 2015.<br />
<br />
Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.<br />
Human-level concept learning through probabilistic program<br />
induction. Science, 350(6266):1332–1338, 2015.<br />
<br />
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,<br />
S. J. Building machines that learn and think like<br />
people. Behavioral and Brain Sciences, 40, 2017.<br />
<br />
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased<br />
learning applied to document recognition. Proceedings<br />
of the IEEE, 86(11):2278–2324, 1998.<br />
<br />
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face<br />
attributes in the wild. In Proceedings of International<br />
Conference on Computer Vision (ICCV), December 2015.<br />
<br />
Louizos, C. and Welling, M. Multiplicative normalizing<br />
flows for variational bayesian neural networks. arXiv<br />
preprint arXiv:1703.01961, 2017.<br />
<br />
Louizos, C., Ullrich, K., and Welling, M. Bayesian compression<br />
for deep learning. In Advances in Neural Information<br />
Processing Systems, pp. 3290–3300, 2017.<br />
<br />
Rasmussen, C. E. and Williams, C. K. Gaussian processes<br />
in machine learning. In Advanced lectures on machine<br />
learning, pp. 63–71. Springer, 2004.<br />
<br />
Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,<br />
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot<br />
autoregressive density estimation: Towards learning to<br />
learn distributions. 2017.<br />
<br />
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic<br />
backpropagation and approximate inference in deep generative<br />
models. arXiv preprint arXiv:1401.4082, 2014.<br />
<br />
Salimbeni, H. and Deisenroth, M. Doubly stochastic variational<br />
inference for deep gaussian processes. In Advances<br />
in Neural Information Processing Systems, pp.<br />
4591–4602, 2017.<br />
<br />
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and<br />
Lillicrap, T. One-shot learning with memory-augmented<br />
neural networks. arXiv preprint arXiv:1605.06065, 2016.<br />
<br />
Snell, J., Swersky, K., and Zemel, R. Prototypical networks<br />
for few-shot learning. In Advances in Neural Information<br />
Processing Systems, pp. 4080–4090, 2017.<br />
<br />
Snelson, E. and Ghahramani, Z. Sparse gaussian processes<br />
using pseudo-inputs. In Advances in neural information<br />
processing systems, pp. 1257–1264, 2006.<br />
<br />
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,<br />
O., Graves, A., et al. Conditional image generation with<br />
pixelcnn decoders. In Advances in Neural Information<br />
Processing Systems, pp. 4790–4798, 2016.<br />
<br />
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.<br />
Matching networks for one shot learning. In Advances in<br />
Neural Information Processing Systems, pp. 3630–3638,<br />
2016.<br />
<br />
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,<br />
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and<br />
Botvinick, M. Learning to reinforcement learn. arXiv<br />
preprint arXiv:1611.05763, 2016.<br />
<br />
Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.<br />
Deep kernel learning. In Artificial Intelligence and Statistics,<br />
pp. 370–378, 2016.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=conditional_neural_process&diff=42162conditional neural process2018-11-30T23:49:50Z<p>Z43ma: Example in Model, fixed mistake about GP</p>
<hr />
<div>== Motivation ==<br />
<br />
Deep neural networks are good at function approximations, yet they are typically trained from scratch for each new function. While Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs<br />
are computationally expensive, and it can be hard to design appropriate priors. Hence the authors propose a propose a family of neural models called, Conditional Neural Processes (CNPs), that combine the benefits of both. <br />
<br />
== Introduction ==<br />
<br />
To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive. <br />
<br />
The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.<br />
<br />
== Model ==<br />
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is to minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.<br />
<br />
Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} \subset X</math> of unlabelled points.<br />
<br />
P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>. <br />
<br />
A good example is given by the authors, consider a random 1-dimensional function <math>f ∼ P</math> defined on the real line (i.e., <math>X := R</math>, <math>Y := R</math>). <math>O</math> would constitute <math>n</math> observations of <math>f</math>’s value <math>y_i</math> at different locations <math>x_i</math> on the real line. Given these observations, we are interested in predicting <math>f</math>’s value at new locations on the real line. <br />
<br />
A common assumption made on P is that all function evaluations of <math display="inline"> f </math> is Gaussian distributed. The random functions class is called Gaussian Processes (GPs). This framework of the stochastic process allows a model to be data efficient, however, it's hard to get appropriate priors and stochastic processes are expensive in computation, scaling poorly with <math>n</math> and <math>m</math>. One of the examples is GPs, which has running time <math>O(n+m)^3</math>.<br />
<br />
[[File:001.jpg|300px|center]]<br />
<br />
== Conditional Neural Process ==<br />
<br />
Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.<br />
<br />
CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>, given a set of observations <math display="inline">O</math>. For stochastic processs, the authors assume that <math display="inline">Q_{\theta}</math> is invariant to permutations, and <math display="inline">Q_\theta(f(T) | O, T)= Q_\theta(f(T') | O, T')=Q_\theta(f(T) | O', T) </math> when <math> O', T'</math> are permutations of <math display="inline">O</math> and <math display="inline">T </math>. In this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure, which is the easiest way to ensure a valid stochastic process. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>. Moreover, this framework can be extended to non-factored distributions.<br />
<br />
In detail, the following architecture is used<br />
<br />
<math display="inline">r_i = h_\theta(x_i, y_i)</math> &forall; <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math><br />
<br />
<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math><br />
<br />
<math display="inline">\Phi_i = g_\theta</math> &forall; <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math><br />
<br />
Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.<br />
<br />
We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly<br />
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution<br />
P given a set of observations. The authors let <math display="inline"> f \sim P</math>, <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{n-1}</math>, and N ~ uniform[0, 1, ..... ,n-1]. Subset <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{N}</math> that is first N elements of <math display="inline">O</math> is regarded as condition. The negative conditional log probability is given by<br />
\[\mathcal{L}(\theta)=-\mathbb{E}_{f \sim p}[\mathbb{E}_{N}[\log Q_\theta(\{y_i\}_{i = 0} ^{n-1}|O_{N}, \{x_i\}_{i = 0} ^{n-1})]]\]<br />
Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed <br />
and unobserved values. In practice, Monte Carlo estimates of the gradient of this loss is taken by sampling <math display="inline">f</math> and <math display="inline">N</math>. <br />
<br />
This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately<br />
intended to summarize their empirical experience. Still, we emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.<br />
<br />
In summary,<br />
<br />
1. A CNP is a conditional distribution over functions<br />
trained to model the empirical conditional distributions<br />
of functions <math display="inline">f \sim P</math>.<br />
<br />
2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.<br />
<br />
3. A CNP is scalable, achieving a running time complexity<br />
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math><br />
observations.<br />
<br />
== Related Work ==<br />
<br />
===Gaussian Process Framework===<br />
<br />
A Gaussian Process (GP) is a non-parametric method for regression, used extensively for regression and classification problems in the machine learning community. A GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.<br />
A standard approach is to model data as <math>y = m(X, φ) + \epsilon</math><br />
where m is the mean function with parameter vector <math>φ</math>, and <math>\epsilon</math> represents independent and identically distributed (i.i.d.) Gaussian noise: <math>N\sim (0,\sigma^2)</math><br />
<br />
For more info on Gaussian Process Framework:<br />
[https://arxiv.org/abs/1506.07304 A Gaussian process framework for modeling instrumental systematics: application to transmission spectroscopy]<br />
<br />
Several papers attempt to address various issues with GPs. These include:<br />
* Using sparse GPs to aid in scaling (Snelson & Ghahramani, 2006)<br />
* Using Deep GPs to achieve more expressiveness (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017)<br />
* Using neural networks to learn more expressive kernels (Wilson et al., 2016)<br />
<br />
A Python resource for Gaussian Process Framework implementation: [https://github.com/SheffieldML/GPyimplementation Gaussian Process Framework in Python]<br />
<br />
<br />
The goal of this paper is to incorporate ideas from standard neural networks with Gaussian processes in order to overcome drawbacks of both. Bayesian techniques work better with less data, but complex Bayesian networks become intractable on even moderate sized data sizes. NNs on the other hand, cannot make use of prior knowledge and often have to be retrained from scratch. Without sufficient data, they also perform poorly. Combining both frameworks, we get Conditional Neural Processes serves to learn the kernels of the Gaussian Process through neural networks and uses these learned kernels on a framework similar to GPs for prediction.<br />
<br />
===Meta Learning===<br />
<br />
Meta-Learning attempts to allow neural networks to learn more generalizable functions, as opposed to only approximating one function. This can be done by learning deep generative models which can do few-shot estimations of data. This can be implemented with attention mechanisms (Reed et al., 2017) or additional memory units in a VAE model (Bornschein et al., 2017). Another successful latent variable approach is to explicitly condition on some context during inference (J. Rezende et al., 2016). Given the generative nature of these models they are usually applied to image generation tasks, but models that include a conditioning class-variable can be used for classification as well. Recently meta-learning has also been applied to a wide range of tasks like RL (Wang et al., 2016; Finn et al., 2017) or program induction (Devlin et al., 2017).<br />
<br />
Classification is another common task in meta-learning, few-shot classification algorithms usually rely on some distance metric in feature space to compare target images and the observations (Koch et al., 2015), (Santoro et al., 2016).. Matching networks(Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs. In their case features of samples are compared with target features using an attention kernel. At a higher level one can interpret this model as a CNP where the aggregator is just the concatenation over all input samples and the decoder <math>g</math> contains an explicitly defined distance kernel. In this sense matching networks are closer to GPs than to CNPs, since they require the specification of a distance kernel that CNPs learn from the data instead. In addition, as MNs carry out all- to-all comparisons they scale with <math> O(n × m) </math>, although they can be modified to have the same complexity of <math>O(n + m)</math> as CNPs (Snell et al., 2017).<br />
<br />
Another field in the meta-learning field is Neural architecture search. It requires the search algorithm to define three things: the search space, search strategy, and performance evaluation strategy. It is one of the most popular trends in the meta-learning field now. The idea is we can define some search space, and let algorithms help us decide what architecture and hyperparameters would be best for a particular task. Also, since evaluating a neural network is expensive(needs train the neural network first), it needs a well designed performance evaluation strategy to lower down the computational cost<br />
<br />
A model that is conceptually very similar to CNPs (and in particular the latent variable version) is the “neural statistician” paper (Edwards & Storkey, 2016) and the related variational homoencoder (Hewitt et al., 2018). As with the<br />
other generative models the neural statistician learns to estimate the density of the observed data but does not allow for targeted sampling at what we have been referring to as input positions <math>x_i</math>. Instead, one can only generate i.i.d. samples from the estimated density. Finally, the latest variant of Conditional Neural Process can also be seen as an approximated amortized version of Bayesian DL(Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017). For example, Gal & Ghahramani 2016 develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Their theory extracts information from existing models and gives us tools to model uncertainty.<br />
<br />
== Experimental Result I: Function Regression ==<br />
<br />
Classical 1D regression task that used as a common baseline for GP is the first example. <br />
They generated two different datasets that consisted of functions<br />
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset, the function switched at some random point. on the real line between two functions, each sampled with<br />
different kernel parameters. At every training step, they sampled a curve from the GP, select<br />
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three-layer MLP encoder h with a 128-dimensional output representation. The representations are aggregated into a single representation<br />
<math display="inline">r = \frac{1}{n} \sum r_i</math><br />
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer<br />
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer. <br />
<br />
Two examples of the regression results obtained for each<br />
of the datasets are shown in the following figure.<br />
<br />
[[File:007.jpg|300px|center]]<br />
<br />
They compared the model to the predictions generated by a GP with the correct<br />
hyperparameters, which constitutes an upper bound on our<br />
performance. Although the prediction generated by the GP<br />
is smoother than the CNP's prediction both for the mean<br />
and variance, the model is able to learn to regress from a few<br />
context points for both the fixed kernels and switching kernels.<br />
As the number of context points grows, the accuracy<br />
of the model improves and the approximated uncertainty<br />
of the model decreases. Crucially, we see the model learns<br />
to estimate its own uncertainty given the observations very<br />
accurately. Nonetheless, it provides a good approximation<br />
that increases in accuracy as the number of context points<br />
increases.<br />
Furthermore, the model achieves similarly good performance<br />
on the switching kernel task. This type of regression task<br />
is not trivial for GPs whereas in our case we only have to<br />
change the dataset used for training<br />
<br />
== Experimental Result II: Image Completion for Digits ==<br />
<br />
[[File:002.jpg|600px|center]]<br />
<br />
They also tested CNP on the MNIST dataset and use the test<br />
set to evaluate its performance. As shown in the above figure the<br />
model learns to make good predictions of the underlying<br />
digit even for a small number of context points. Crucially,<br />
when conditioned only on one non-informative context point the model’s prediction corresponds<br />
to the average overall MNIST digits. As the number<br />
of context points increases the predictions become more<br />
similar to the underlying ground truth. This demonstrates<br />
the model’s capacity to extract dataset specific prior knowledge.<br />
It is worth mentioning that even with a complete set<br />
of observations, the model does not achieve pixel-perfect<br />
reconstruction, as we have a bottleneck at the representation<br />
level.<br />
Since this implementation of CNP returns factored outputs,<br />
the best prediction it can produce given limited context<br />
information is to average over all possible predictions that<br />
agree with the context. An alternative to this is to add<br />
latent variables in the model such that they can be sampled<br />
conditioned on the context to produce predictions with high<br />
probability in the data distribution. <br />
<br />
<br />
An important aspect of the model is its ability to estimate<br />
the uncertainty of the prediction. As shown in the bottom<br />
row of the above figure, as they added more observations, the variance<br />
shifts from being almost uniformly spread over the digit<br />
positions to being localized around areas that are specific<br />
to the underlying digit, specifically its edges. Being able to<br />
model the uncertainty given some context can be helpful for<br />
many tasks. One example is active exploration, where the<br />
model has a choice over where to observe.<br />
They tested this by<br />
comparing the predictions of CNP when the observations<br />
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active<br />
exploration, but it already produces better prediction results<br />
then selecting the conditioning points at random.<br />
<br />
== Experimental Result III: Image Completion for Faces ==<br />
<br />
<br />
[[File:003.jpg|400px|center]]<br />
<br />
<br />
They also applied CNP to CelebA, a dataset of images of<br />
celebrity faces and reported performance obtained on the<br />
test set.<br />
<br />
As shown in the above figure our model is able to capture<br />
the complex shapes and colors of this dataset with predictions<br />
conditioned on less than 10% of the pixels being<br />
already close to the ground truth. As before, given a few contexts<br />
points the model averages over all possible faces, but as<br />
the number of context pairs increases the predictions capture<br />
image-specific details like face orientation and facial<br />
expression. Furthermore, as the number of context points<br />
increases the variance is shifted towards the edges in the<br />
image.<br />
<br />
[[File:004.jpg|400px|center]]<br />
<br />
An important aspect of CNPs demonstrated in the above figure is<br />
it's flexibility not only in the number of observations and<br />
targets it receives but also with regards to their input values.<br />
It is interesting to compare this property to GPs on one hand,<br />
and to trained generative models (van den Oord et al., 2016;<br />
Gregor et al., 2015) on the other hand.<br />
The first type of flexibility can be seen when conditioning on<br />
subsets that the model has not encountered during training.<br />
Consider conditioning the model on one half of the image,<br />
fox example. This forces the model to not only predict the pixel<br />
values according to some stationary smoothness property of<br />
the images, but also according to global spatial properties,<br />
e.g. symmetry and the relative location of different parts of<br />
faces. As seen in the first row of the figure, CNPs are able to<br />
capture those properties. A GP with a stationary kernel cannot<br />
capture this, and in the absence of observations would<br />
revert to its mean (the mean itself can be non-stationary but<br />
usually, this would not be enough to capture the interesting<br />
properties).<br />
<br />
In addition, the model is flexible with regards to the target<br />
input values. This means, e.g., we can query the model<br />
at resolutions it has not seen during training. We take a<br />
model that has only been trained using pixel coordinates of<br />
a specific resolution and predict at test time subpixel values<br />
for targets between the original coordinates. As shown in<br />
Figure 5, with one forward pass we can query the model at<br />
different resolutions. While GPs also exhibit this type of<br />
flexibility, it is not the case for trained generative models,<br />
which can only predict values for the pixel coordinates on<br />
which they were trained. In this sense, CNPs capture the best<br />
of both worlds – it is flexible in regards to the conditioning<br />
and prediction task and has the capacity to extract domain<br />
knowledge from a training set.<br />
<br />
[[File:010.jpg|400px|center]]<br />
<br />
<br />
They compared CNPs quantitatively to two related models:<br />
kNNs and GPs. As shown in the above table CNPs outperform<br />
the latter when a number of context points are small (empirically<br />
when half of the image or less is provided as context).<br />
When the majority of the image is given as context exact<br />
methods like GPs and kNN will perform better. From the table<br />
we can also see that the order in which the context points<br />
are provided is less important for CNPs, since providing the<br />
context points in order from top to bottom still results in<br />
good performance. Both insights point to the fact that CNPs<br />
learn a data-specific ‘prior’ that will generate good samples<br />
even when the number of context points is very small.<br />
<br />
== Experimental Result IV: Classification ==<br />
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. The authors used 1,200 randomly selected classes as their training set and the remainder as the testing data set.<br />
<br />
Additionally, to apply data augmentation the authors cropped the image from 32 × 32 to 28 × 28, applied small random<br />
translations and rotations to the inputs, and also increased<br />
the number of classes by rotating every character by 90<br />
degrees and defining that to be a new class. They generated<br />
the labels for an N-way classification task by choosing N<br />
random classes at each training step and arbitrarily assigning<br />
the labels 0, ..., N − 1 to each.<br />
<br />
<br />
[[File:008.jpg|400px|center]]<br />
<br />
Given that the input points are images, they modified the architecture<br />
of the encoder h to include convolution layers as<br />
mentioned in section 2. In addition, they only aggregated over<br />
inputs of the same class by using the information provided<br />
by the input label. The aggregated class-specific representations<br />
are then concatenated to form the final representation.<br />
Given that both the size of the class-specific representations<br />
and the number of classes is constant, the size of the final<br />
representation is still constant and thus the O(n + m)<br />
runtime still holds.<br />
The results of the classification are summarized in the following table<br />
CNPs achieve higher accuracy than models that are significantly<br />
more complex (like MANN). While CNPs do not<br />
beat state of the art for one-shot classification our accuracy<br />
values are comparable. Crucially, they reached those values<br />
using a significantly simpler architecture (three convolutional<br />
layers for the encoder and a three-layer MLP for the<br />
decoder) and with a lower runtime of O(n + m) at test time<br />
as opposed to O(nm)<br />
<br />
== Conclusion ==<br />
<br />
The paper introduced Conditional Neural Processes,<br />
a model that is both flexible at test time and has the<br />
capacity to extract prior knowledge from training data.<br />
<br />
The authors had demonstrated its ability to perform a variety of tasks<br />
including regression, classification and image completion.<br />
The paper compared CNP's to Gaussian Processes on one hand, and<br />
deep learning methods on the other, and also discussed the<br />
relation to meta-learning and few-shot learning.<br />
It is important to note that the specific CNP implementations<br />
described here are just simple proofs-of-concept and can<br />
be substantially extended, e.g. by including more elaborate<br />
architectures in line with modern deep learning advances.<br />
To summarize, this work can be seen as a step towards learning<br />
high-level abstractions, one of the grand challenges of<br />
contemporary machine learning. Functions learned by most<br />
Conditional Neural Processes<br />
conventional deep learning models are tied to a specific, constrained<br />
statistical context at any stage of training. A trained<br />
CNP is more general, in that it encapsulates the high-level<br />
statistics of a family of functions. As such it constitutes a<br />
high-level abstraction that can be reused for multiple tasks.<br />
In future work, they are going to explore how far these models can<br />
help in tackling the many key machine learning problems<br />
that seem to hinge on abstraction, such as transfer learning,<br />
meta-learning, and data efficiency.<br />
<br />
== Critiques ==<br />
<br />
This paper introduces a method, for reducing the computational complexity of the more famous Gaussian Processes model, but they have mentioned a complexity of O(n + m) which is almost the same order of RBF kernel GP. With respect to performances in a sequence of tasks, the authors have not made metric comparisons to GP methods to prove the superiority of their approach.<br />
<br />
It appears that the proposed model is effective in making accurate predictions using lower quality inputs. For example, a dataset with fewer data points or an image with fewer pixels. However, it is not clear whether the proposed algorithm can be trained with a smaller amount of input data.<br />
<br />
== Other Sources ==<br />
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]<br />
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]<br />
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]<br />
<br />
== Reference ==<br />
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative<br />
models with generative matching networks. arXiv<br />
preprint arXiv:1612.02192, 2016.<br />
<br />
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,<br />
D. Weight uncertainty in neural networks. arXiv preprint<br />
arXiv:1505.05424, 2015.<br />
<br />
Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.<br />
Variational memory addressing in generative models. In<br />
Advances in Neural Information Processing Systems, pp.<br />
3923–3932, 2017.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.<br />
<br />
Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and<br />
Kohli, P. Neural program meta-induction. In Advances in<br />
Neural Information Processing Systems, pp. 2077–2085,<br />
2017.<br />
<br />
Edwards, H. and Storkey, A. Towards a neural statistician.<br />
2016.<br />
<br />
Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning<br />
for fast adaptation of deep networks. arXiv<br />
preprint arXiv:1703.03400, 2017.<br />
<br />
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:<br />
Representing model uncertainty in deep learning.<br />
In international conference on machine learning, pp.<br />
1050–1059, 2016.<br />
<br />
Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards<br />
deep symbolic reinforcement learning. arXiv preprint<br />
arXiv:1609.05518, 2016.<br />
<br />
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and<br />
Wierstra, D. Draw: A recurrent neural network for image<br />
generation. arXiv preprint arXiv:1502.04623, 2015.<br />
<br />
Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The<br />
variational homoencoder: Learning to infer high-capacity<br />
generative models from few examples. 2018.<br />
<br />
J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,<br />
et al. One-shot generalization in deep generative models.<br />
In International Conference on Machine Learning, pp.<br />
1521–1529, 2016.<br />
<br />
Kingma, D. P. and Ba, J. Adam: A method for stochastic<br />
optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
Kingma, D. P. and Welling, M. Auto-encoding variational<br />
bayes. arXiv preprint arXiv:1312.6114, 2013.<br />
<br />
Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural<br />
networks for one-shot image recognition. In ICML Deep<br />
Learning Workshop, volume 2, 2015.<br />
<br />
Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.<br />
Human-level concept learning through probabilistic program<br />
induction. Science, 350(6266):1332–1338, 2015.<br />
<br />
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,<br />
S. J. Building machines that learn and think like<br />
people. Behavioral and Brain Sciences, 40, 2017.<br />
<br />
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased<br />
learning applied to document recognition. Proceedings<br />
of the IEEE, 86(11):2278–2324, 1998.<br />
<br />
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face<br />
attributes in the wild. In Proceedings of International<br />
Conference on Computer Vision (ICCV), December 2015.<br />
<br />
Louizos, C. and Welling, M. Multiplicative normalizing<br />
flows for variational bayesian neural networks. arXiv<br />
preprint arXiv:1703.01961, 2017.<br />
<br />
Louizos, C., Ullrich, K., and Welling, M. Bayesian compression<br />
for deep learning. In Advances in Neural Information<br />
Processing Systems, pp. 3290–3300, 2017.<br />
<br />
Rasmussen, C. E. and Williams, C. K. Gaussian processes<br />
in machine learning. In Advanced lectures on machine<br />
learning, pp. 63–71. Springer, 2004.<br />
<br />
Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,<br />
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot<br />
autoregressive density estimation: Towards learning to<br />
learn distributions. 2017.<br />
<br />
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic<br />
backpropagation and approximate inference in deep generative<br />
models. arXiv preprint arXiv:1401.4082, 2014.<br />
<br />
Salimbeni, H. and Deisenroth, M. Doubly stochastic variational<br />
inference for deep gaussian processes. In Advances<br />
in Neural Information Processing Systems, pp.<br />
4591–4602, 2017.<br />
<br />
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and<br />
Lillicrap, T. One-shot learning with memory-augmented<br />
neural networks. arXiv preprint arXiv:1605.06065, 2016.<br />
<br />
Snell, J., Swersky, K., and Zemel, R. Prototypical networks<br />
for few-shot learning. In Advances in Neural Information<br />
Processing Systems, pp. 4080–4090, 2017.<br />
<br />
Snelson, E. and Ghahramani, Z. Sparse gaussian processes<br />
using pseudo-inputs. In Advances in neural information<br />
processing systems, pp. 1257–1264, 2006.<br />
<br />
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,<br />
O., Graves, A., et al. Conditional image generation with<br />
pixelcnn decoders. In Advances in Neural Information<br />
Processing Systems, pp. 4790–4798, 2016.<br />
<br />
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.<br />
Matching networks for one shot learning. In Advances in<br />
Neural Information Processing Systems, pp. 3630–3638,<br />
2016.<br />
<br />
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,<br />
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and<br />
Botvinick, M. Learning to reinforcement learn. arXiv<br />
preprint arXiv:1611.05763, 2016.<br />
<br />
Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.<br />
Deep kernel learning. In Artificial Intelligence and Statistics,<br />
pp. 370–378, 2016.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42161A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:40:31Z<p>Z43ma: </p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labelled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>\epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at <math>ln(2)</math>, the cross-entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to <math>ln(2)</math>. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size <math>B \leq 10</math>. In figure 4b we plot the mean test set accuracy after 10,000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances <math>F (ω)</math> are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that <math>Bopt ∝ εN</math>. In figure 5a, we plot the test accuracy as a function of batch size after <math>(10000/ε)</math> training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as <math>ε</math> increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until <math>ε ∼ 3</math>, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, <math>Bopt ∝ ε</math>. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (<math>ε = 1</math> everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, <math>Bopt ∝ N</math>. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
<math>B(1−m)</math> scale of conventional SGD as <math>m → 0</math>. When <math>m > 0</math>, we obtain an additional scaling rule <math>Bopt ∝ 1/(1 − m)</math>. This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (<math>ε = 1</math> everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life data.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where <math>m</math> is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42160A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:40:08Z<p>Z43ma: </p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labelled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>&epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at <math>ln(2)</math>, the cross-entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to <math>ln(2)</math>. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size <math>B \leq 10</math>. In figure 4b we plot the mean test set accuracy after 10,000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances <math>F (ω)</math> are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that <math>Bopt ∝ εN</math>. In figure 5a, we plot the test accuracy as a function of batch size after <math>(10000/ε)</math> training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as <math>ε</math> increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until <math>ε ∼ 3</math>, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, <math>Bopt ∝ ε</math>. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (<math>ε = 1</math> everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, <math>Bopt ∝ N</math>. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
<math>B(1−m)</math> scale of conventional SGD as <math>m → 0</math>. When <math>m > 0</math>, we obtain an additional scaling rule <math>Bopt ∝ 1/(1 − m)</math>. This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (<math>ε = 1</math> everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life data.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where <math>m</math> is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42159A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:39:05Z<p>Z43ma: </p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labelled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>&epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. . It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at <math>ln(2)</math>, the cross-entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to <math>ln(2)</math>. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size B 10. In figure 4b we plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances <math>F (ω)</math> are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that <math>Bopt ∝ εN</math>. In figure 5a, we plot the test accuracy as a function of batch size after <math>(10000/ε)</math> training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as <math>ε</math> increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until <math>ε ∼ 3</math>, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, <math>Bopt ∝ ε</math>. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (<math>ε = 1</math> everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, <math>Bopt ∝ N</math>. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
<math>B(1−m)</math> scale of conventional SGD as <math>m → 0</math>. When <math>m > 0</math>, we obtain an additional scaling rule <math>Bopt ∝ 1/(1 − m)</math>. This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (<math>ε = 1</math> everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life data.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where <math>m</math> is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42158A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:36:08Z<p>Z43ma: </p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>&epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. . It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at ln 2, the cross- entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to ln 2. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size B 10. In figure 4b we plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances F (ω) are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that Bopt ∝ εN. In figure 5a, we plot the test accuracy as a function of batch size after <math>(10000/ε)</math> training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as ε increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until <math>ε ∼ 3</math>, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, <math>Bopt ∝ ε</math>. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (<math>ε = 1</math> everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, <math>Bopt ∝ N</math>. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
<math>B(1−m)</math> scale of conventional SGD as <math>m → 0</math>. When <math>m > 0</math>, we obtain an additional scaling rule <math>Bopt ∝ 1/(1 − m)</math>. This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (<math>ε = 1</math> everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life data.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where <math>m</math> is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42157A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:34:19Z<p>Z43ma: Grammer update</p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>&epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. . It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at ln 2, the cross- entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to ln 2. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size B 10. In figure 4b we plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances F (ω) are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that Bopt ∝ εN. In figure 5a, we plot the test accuracy as a function of batch size after (10000/ε) training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as ε increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until ε ∼ 3, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, Bopt ∝ ε. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (ε = 1 everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, Bopt ∝ N. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
B(1−m)<br />
scale of conventional SGD as <math>m → 0</math>. When <math>m > 0</math>, we obtain an additional scaling rule <math>Bopt ∝ 1/(1 − m)</math>. This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (<math>ε = 1</math> everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life data.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where <math>m</math> is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42156A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:29:57Z<p>Z43ma: </p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>&epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. . It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at ln 2, the cross- entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to ln 2. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size B 10. In figure 4b we plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances F (ω) are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that Bopt ∝ εN. In figure 5a, we plot the test accuracy as a function of batch size after (10000/ε) training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as ε increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until ε ∼ 3, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, Bopt ∝ ε. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (ε = 1 everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, Bopt ∝ N. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
B(1−m)<br />
scale of conventional SGD as m → 0. When m > 0, we obtain an additional scaling rule Bopt ∝ 1/(1 − m). This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (ε = 1 everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life datas.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that Mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where m is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42155A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:29:36Z<p>Z43ma: </p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>\epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. . It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at ln 2, the cross- entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to ln 2. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size B 10. In figure 4b we plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances F (ω) are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that Bopt ∝ εN. In figure 5a, we plot the test accuracy as a function of batch size after (10000/ε) training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as ε increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until ε ∼ 3, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, Bopt ∝ ε. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (ε = 1 everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, Bopt ∝ N. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
B(1−m)<br />
scale of conventional SGD as m → 0. When m > 0, we obtain an additional scaling rule Bopt ∝ 1/(1 − m). This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (ε = 1 everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life datas.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that Mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where m is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42154A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-11-30T23:28:33Z<p>Z43ma: Format update.</p>
<hr />
<div>==Introduction==<br />
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where &epsilon is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. . It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the obser- vations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at ln 2, the cross- entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully gener- alizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evi- dence ratio rises to zero while the test cross-entropy rises to ln 2. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size B 10. In figure 4b we plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances F (ω) are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that Bopt ∝ εN. In figure 5a, we plot the test accuracy as a function of batch size after (10000/ε) training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as ε increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until ε ∼ 3, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, Bopt ∝ ε. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (ε = 1 everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, Bopt ∝ N. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
B(1−m)<br />
scale of conventional SGD as m → 0. When m > 0, we obtain an additional scaling rule Bopt ∝ 1/(1 − m). This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (ε = 1 everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life datas.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that Mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where m is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=42153Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-11-30T23:20:24Z<p>Z43ma: </p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. Deep CNNs for large, non-convex continuous action spaces are not directly applicable. To solve this issue, we conduct a policy search with an efficient stochastic continuous action search on top of policy samples generated from a deep CNN. Our deep CNN still discretizes the state space and the action space. However, in<br />
the stochastic continuous action search, we lift the restriction of the deterministic discretization and conduct a local search procedure in a physical simulator with continuous action samples. In this way, the benefits of both deep neural networks and physical simulators can be realized.<br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game play, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Game play ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. <br />
<br />
<br />
Go game:<br />
* Start with 19x19 empty board<br />
* One player takes black stones and the other take white stones<br />
* Two players take turns to put stones on the board<br />
* Once the stone has been placed, the stones cannot be moved anymore<br />
* Rules:<br />
1. If one connected part is completely surrounded by the opponent's stones, remove it from the board<br />
<br />
2. Ko rule: Forbids a board play to repeat a board position<br />
* End when there are no valuable moves. <br />
* Count the territory of both players. The objective of the game is to capture more territory than your opponent. The player with black stone plays first. However, the black player needs to give 7.5 points to whites points (called Komi) as a tradeoff. There are some variations on how much points the player with the black stone should give based on different rules in different Asia countries.<br />
* This game used to be a huge challenge to artificial intelligence due to two reasons. One is the search space is extremely large. It is estimated to be on the order of (<math>10^{172}</math>), which is more than the number of atoms in the universe, and it is much larger than the game states in Chess (<math>10^{47}</math>). Another reason is there was no good heuristic function for evaluating a situation in Go. So the traditional alpha-beta pruning algorithm will not have good performance due to the poor heuristic function. For Alpha go lee, the CNN plays a role like a good heuristic function, which results on the huge performance improvement of the AI.<br />
[[File:go.JPG|700px|center]]<br />
<br />
Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.<br />
<br />
The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network. <br />
<br />
Finally, the two networks are combined using Monte-Carlo Tree Search, which performs a look-ahead search to select the actions for gameplay.<br />
<br />
The use of both policy and value networks are reflected in this paper's work.<br />
<br />
=== AlphaGo Zero ===<br />
<br />
AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.<br />
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.<br />
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.<br />
<br />
This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.<br />
<br />
This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.<br />
<br />
Other differences from the previous AlphaGo iterations are as follows. AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features. It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently. AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions. All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.<br />
<br />
The unification of networks and self-play are also reflected in this paper.<br />
<br />
=== Curling Algorithms ===<br />
<br />
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers. <br />
<br />
With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.<br />
<br />
=== Curling Physics and Simulation ===<br />
<br />
Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixed friction coefficient was predefined in the simulation. The behavior of the stones was also modeled. Important parameters are trained from professional players. The authors used the same parameters in this paper.<br />
<br />
== General Background of Algorithms ==<br />
<br />
=== Policy and Value Functions ===<br />
<br />
A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.<br />
<br />
POLICY IMPROVEMENT: LEARNING ACTION POLICY<br />
<br />
Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. Here <math> \sigma </math> denotes the weights of a neural network that approximates the policy. <math>s</math> denotes the set of states and <math>a</math> denotes the set of actions taken in the environment. The policy is a function that returns a action given the state at which the agent is present. The policy gradient reinforcement learning can be used to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,<br />
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]<br />
where <math> r(s_t) </math> is the return.<br />
<br />
POLICY EVALUATION: LEARNING VALUE FUNCTIONS<br />
<br />
A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,<br />
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).<br />
<br />
MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.<br />
<br />
Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.<br />
<br />
MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.<br />
<br />
The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.<br />
<br />
Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.<br />
<br />
The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.<br />
<br />
A selection function known as Upper Confidence Bound applied to Trees (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future. This formula serves the purpose of balance exploitation (first term) and exploration (second term) in Monte Carlo Tree Search. The philosophy is that nodes with high rewards and nodes poorly explored should both be explored more often.<br />
<br />
Note that the Upper Confidence Bound (UCB) formula can achieve the optimal solution of the multi-arm bandit problem theoretically.<br />
<br />
<math><math> \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} </math></math><br />
<br />
In which<br />
<br />
* <math> w_i = </math> number of wins after <math> i</math>th move<br />
* <math> n_i = </math> number of simulations after <math> i</math>th move<br />
* <math> c = </math> exploration parameter (theoritically eqal to <math> \sqrt{2}</math>)<br />
* <math> t = </math> total number of simulations for the parent node<br />
<br />
<br />
Sources: 2,3,4<br />
<br />
[[File:MCTS_Diagram.jpg | 500px|center]]<br />
<br />
=== Kernel Regression ===<br />
<br />
Kernel regression is a form of weighted averaging which uses a kernel function as a weight to estimate the conditional expectation of a random variable. Given two items of data, '''x''', each of which has a value '''y''' associated with them, and a choice of Kernel '''K''', the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.<br />
<br />
A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).<br />
<br />
[[File:gaussian_kernel.png | 400 px]]<br />
<br />
[[File:kernel_regression.png | 250 px]]<br />
<br />
The denominator of the conditional expectation is related to kernel density estimation, which is defined as <math display="inline">W(x)=\sum_{i=0}^n K(x,x_i)</math>.<br />
<br />
In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.<br />
<br />
= Methods =<br />
<br />
== Variable Definitions ==<br />
<br />
The following variables are used often in the paper:<br />
<br />
* <math>s</math>: A state in the game, as described below as the input to the network.<br />
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game<br />
* <math>a_t</math>: The action taken in state <math>s_t</math><br />
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS<br />
* <math>n_{a_t}</math>: The number of visits to node a in MCTS<br />
* <math>v_{a_t}</math>: The MCTS value estimate of a node<br />
<br />
== Network Design ==<br />
<br />
The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.<br />
<br />
=== Shared Structure ===<br />
<br />
The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:<br />
<br />
<br />
[[File:curling_network_layers.png|600px|thumb|center|Figure 2. A detail description of our policy-value network. The shared network is composed of one convolutional layer and nine residual blocks. Each residual block (explained in b) has two convolutional layer with batch normalization (Ioffe & Szegedy, 2015[11]) followed by the addition of the input and the residual block. Each layer in the shared network uses 3x3 filters. The policy head<br />
has two more convolutional layers, while the value head has two fully connected layers on top of a convolutional layer. For the activation function of each convolutional layer, ReLU (Nair & Hinton[12]) is used.]]<br />
<br />
<br />
<br />
The input to this network is the following:<br />
* Location of stones<br />
* Order to tee (the center of the sheet)<br />
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.<br />
<br />
The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.<br />
<br />
=== Policy Network ===<br />
<br />
The policy head is created by adding 2 convolutional layers with 2 (two) 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.<br />
<br />
[[File:policy-value-net.PNG | 700px]]<br />
<br />
=== Value Network ===<br />
<br />
The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.<br />
<br />
== Continuous Action Search ==<br />
<br />
The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.<br />
<br />
Actions in the continuous space are generated using an MCTS algorithm, with the following steps:<br />
<br />
=== Selection ===<br />
<br />
From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.<br />
<br />
[[File:curling_kernel_equations.png | 400px]]<br />
<br />
The UCB formula is then used to select an action to expand.<br />
<br />
The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.<br />
<br />
=== Expansion ===<br />
<br />
The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.<br />
<br />
=== Simulation ===<br />
<br />
Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.<br />
<br />
=== Backpropogation ===<br />
<br />
Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.<br />
<br />
<br />
== Supervised Learning ==<br />
<br />
During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400,000 state-action pairs were generated during this training.<br />
<br />
=== Policy Network ===<br />
<br />
The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.<br />
<br />
=== Value Network ===<br />
<br />
The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.<br />
<br />
=== Policy-Value Network ===<br />
<br />
The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:<br />
<br />
* Algorithm: stochastic gradient descent<br />
* Batch size: 256<br />
* Momentum: 0.9<br />
* L2 regularization: 0.0001<br />
* Training time: ~100 epochs<br />
* Learning rate: initialized at 0.01, reduced twice<br />
<br />
A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:<br />
<br />
[[File:curling_loss_function.png | 300px]]<br />
<br />
== Self-Play Reinforcement Learning ==<br />
<br />
After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.<br />
<br />
At a game state ''s<sub>t</sub>'':<br />
<br />
1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.<br />
<br />
2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.<br />
<br />
It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.<br />
<br />
The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.<br />
<br />
It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.<br />
<br />
== Long-Term Strategy Learning ==<br />
<br />
Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.<br />
<br />
The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.<br />
<br />
== Final Algorithms ==<br />
<br />
The authors make use of the following versions of their algorithm:<br />
<br />
=== KR-DL ===<br />
<br />
''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.<br />
<br />
=== KR-DRL ===<br />
<br />
''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.<br />
<br />
=== KR-DRL-MES ===<br />
<br />
''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.<br />
<br />
= Testing and Results =<br />
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.<br />
<br />
== Comparison of KR-DL-UCT and DL-UCT ==<br />
<br />
The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.<br />
<br />
As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.<br />
<br />
<center>[[File:curling_KR_test.png | 400px]]</center><br />
<br />
== Matches ==<br />
<br />
Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. Figure 4 shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors. They also played some games between their KR-DRL-MES and notable<br />
programs. Table 1, shows the details of the match results. ''JiritsukunGAT'17'' shows a similar level of performance but KR-DRL-MES is still the winner.<br />
<br />
<br />
<br />
[[File:curling_ratings.png|600px|thumb|center|Figure 4. Elo rating and winning percentages of our models and GAT rankers. Each match has 200 games (each program plays 100 pre-ordered games), because the player which has the last shot (the hammer shot) in each end would have an advantage.]]<br />
<br />
<br />
[[File:ttt.png|600px|thumb|center|Table 1. The 8-end game results for KR-DRL-MES against other programs alternating the opening player each game. The matches are held by following the rules of the latest GAT competition.]]<br />
<br />
= Conclusion & Critique =<br />
<br />
The authors have presented a new framework which incorporates a deep neural network for learning game strategy with a kernel-based Monte Carlo tree search from a continuous space. Without the use of any hand-crafted feature, their policy-value network is successfully trained using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling. Following are my critiques on the paper:<br />
<br />
== Strengths ==<br />
<br />
This algorithm out-performs other high-performance algorithms (including past competition champions).<br />
<br />
I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.<br />
<br />
The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.<br />
<br />
== Weaknesses ==<br />
<br />
Somtimes, I found this paper difficult to follow. One problem was that the algorithms were introduced first, and then how they were used was described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run and at what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.<br />
<br />
While I think the comparison of different algorithms was done well, I believe it still lacked significant details. There were one-off mentioned in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.<br />
<br />
At this point, the algorithms used still rely on initialization by a pre-made program.<br />
<br />
There was little theoretical development or justification done in this paper.<br />
<br />
While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limited. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.<br />
<br />
While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.<br />
<br />
Also, the neccessity to discretize state and action in the CNN is disputable. With careful design maybe we can incorporate continuous inputs.<br />
<br />
=References=<br />
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)<br />
# https://www.baeldung.com/java-monte-carlo-tree-search<br />
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/<br />
# https://int8.io/monte-carlo-tree-search-beginners-guide/<br />
# https://en.wikipedia.org/wiki/Monte_Carlo_tree_search<br />
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.<br />
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.<br />
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.<br />
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.<br />
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.<br />
# Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, ICML, pp. 448–456, 2015.<br />
# Nair, V. and Hinton, G. Rectified linear units improve restricted boltzmann machines.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=42152Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-11-30T23:17:05Z<p>Z43ma: Format update.</p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. Deep CNNs for large, non-convex continuous action spaces are not directly applicable. To solve this issue, we conduct a policy search with an efficient stochastic continuous action search on top of policy samples generated from a deep CNN. Our deep CNN still discretizes the state space and the action space. However, in<br />
the stochastic continuous action search, we lift the restriction of the deterministic discretization and conduct a local search procedure in a physical simulator with continuous action samples. In this way, the benefits of both deep neural networks and physical simulators can be realized.<br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game play, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Game play ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. <br />
<br />
<br />
Go game:<br />
* Start with 19x19 empty board<br />
* One player takes black stones and the other take white stones<br />
* Two players take turns to put stones on the board<br />
* Once the stone has been placed, the stones cannot be moved anymore<br />
* Rules:<br />
1. If one connected part is completely surrounded by the opponent's stones, remove it from the board<br />
<br />
2. Ko rule: Forbids a board play to repeat a board position<br />
* End when there are no valuable moves. <br />
* Count the territory of both players. The objective of the game is to capture more territory than your opponent. The player with black stone plays first. However, the black player needs to give 7.5 points to whites points (called Komi) as a tradeoff. There are some variations on how much points the player with the black stone should give based on different rules in different Asia countries.<br />
* This game used to be a huge challenge to artificial intelligence due to two reasons. One is the search space is extremely large. It is estimated to be on the order of (<math>10^{172}</math>), which is more than the number of atoms in the universe, and it is much larger than the game states in Chess (<math>10^{47}</math>). Another reason is there was no good heuristic function for evaluating a situation in Go. So the traditional alpha-beta pruning algorithm will not have good performance due to the poor heuristic function. For Alpha go lee, the CNN plays a role like a good heuristic function, which results on the huge performance improvement of the AI.<br />
[[File:go.JPG|700px|center]]<br />
<br />
Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.<br />
<br />
The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network. <br />
<br />
Finally, the two networks are combined using Monte-Carlo Tree Search, which performs a look-ahead search to select the actions for gameplay.<br />
<br />
The use of both policy and value networks are reflected in this paper's work.<br />
<br />
=== AlphaGo Zero ===<br />
<br />
AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.<br />
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.<br />
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.<br />
<br />
This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.<br />
<br />
This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.<br />
<br />
Other differences from the previous AlphaGo iterations are as follows. AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features. It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently. AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions. All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.<br />
<br />
The unification of networks and self-play are also reflected in this paper.<br />
<br />
=== Curling Algorithms ===<br />
<br />
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers. <br />
<br />
With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.<br />
<br />
=== Curling Physics and Simulation ===<br />
<br />
Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixed friction coefficient was predefined in the simulation. The behavior of the stones was also modeled. Important parameters are trained from professional players. The authors used the same parameters in this paper.<br />
<br />
== General Background of Algorithms ==<br />
<br />
=== Policy and Value Functions ===<br />
<br />
A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.<br />
<br />
POLICY IMPROVEMENT: LEARNING ACTION POLICY<br />
<br />
Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. Here <math> \sigma </math> denotes the weights of a neural network that approximates the policy. <math>s</math> denotes the set of states and <math>a</math> denotes the set of actions taken in the environment. The policy is a function that returns a action given the state at which the agent is present. The policy gradient reinforcement learning can be used to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,<br />
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]<br />
where <math> r(s_t) </math> is the return.<br />
<br />
POLICY EVALUATION: LEARNING VALUE FUNCTIONS<br />
<br />
A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,<br />
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).<br />
<br />
MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.<br />
<br />
Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.<br />
<br />
MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.<br />
<br />
The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.<br />
<br />
Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.<br />
<br />
The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.<br />
<br />
A selection function known as Upper Confidence Bound applied to Trees (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future. This formula serves the purpose of balance exploitation (first term) and exploration (second term) in Monte Carlo Tree Search. The philosophy is that nodes with high rewards and nodes poorly explored should both be explored more often.<br />
<br />
Note that the Upper Confidence Bound (UCB) formula can achieve the optimal solution of the multi-arm bandit problem theoretically.<br />
<br />
<math><math> \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} </math></math><br />
<br />
In which<br />
<br />
* <math> w_i = </math> number of wins after <math> i</math>th move<br />
* <math> n_i = </math> number of simulations after <math> i</math>th move<br />
* <math> c = </math> exploration parameter (theoritically eqal to <math> \sqrt{2}</math>)<br />
* <math> t = </math> total number of simulations for the parent node<br />
<br />
<br />
Sources: 2,3,4<br />
<br />
[[File:MCTS_Diagram.jpg | 500px|center]]<br />
<br />
=== Kernel Regression ===<br />
<br />
Kernel regression is a form of weighted averaging which uses a kernel function as a weight to estimate the conditional expectation of a random variable. Given two items of data, '''x''', each of which has a value '''y''' associated with them, and a choice of Kernel '''K''', the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.<br />
<br />
A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).<br />
<br />
[[File:gaussian_kernel.png | 400 px]]<br />
<br />
[[File:kernel_regression.png | 250 px]]<br />
<br />
The denominator of the conditional expectation is related to kernel density estimation, which is defined as <math display="inline">W(x)=\sum_{i=0}^n K(x,x_i)</math>.<br />
<br />
In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.<br />
<br />
= Methods =<br />
<br />
== Variable Definitions ==<br />
<br />
The following variables are used often in the paper:<br />
<br />
* <math>s</math>: A state in the game, as described below as the input to the network.<br />
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game<br />
* <math>a_t</math>: The action taken in state <math>s_t</math><br />
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS<br />
* <math>n_{a_t}</math>: The number of visits to node a in MCTS<br />
* <math>v_{a_t}</math>: The MCTS value estimate of a node<br />
<br />
== Network Design ==<br />
<br />
The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.<br />
<br />
=== Shared Structure ===<br />
<br />
The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:<br />
<br />
<br />
[[File:curling_network_layers.png|600px|thumb|center|Figure 2. A detail description of our policy-value network. The shared network is composed of one convolutional layer and nine residual blocks. Each residual block (explained in b) has two convolutional layer with batch normalization (Ioffe & Szegedy, 2015[11]) followed by the addition of the input and the residual block. Each layer in the shared network uses 3x3 filters. The policy head<br />
has two more convolutional layers, while the value head has two fully connected layers on top of a convolutional layer. For the activation function of each convolutional layer, ReLU (Nair & Hinton[12]) is used.]]<br />
<br />
<br />
<br />
The input to this network is the following:<br />
* Location of stones<br />
* Order to tee (the center of the sheet)<br />
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.<br />
<br />
The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.<br />
<br />
=== Policy Network ===<br />
<br />
The policy head is created by adding 2 convolutional layers with 2 (two) 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.<br />
<br />
[[File:policy-value-net.PNG | 700px]]<br />
<br />
=== Value Network ===<br />
<br />
The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.<br />
<br />
== Continuous Action Search ==<br />
<br />
The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.<br />
<br />
Actions in the continuous space are generated using an MCTS algorithm, with the following steps:<br />
<br />
=== Selection ===<br />
<br />
From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.<br />
<br />
[[File:curling_kernel_equations.png | 400px]]<br />
<br />
The UCB formula is then used to select an action to expand.<br />
<br />
The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.<br />
<br />
=== Expansion ===<br />
<br />
The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.<br />
<br />
=== Simulation ===<br />
<br />
Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.<br />
<br />
=== Backpropogation ===<br />
<br />
Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.<br />
<br />
<br />
== Supervised Learning ==<br />
<br />
During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.<br />
<br />
=== Policy Network ===<br />
<br />
The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.<br />
<br />
=== Value Network ===<br />
<br />
The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.<br />
<br />
=== Policy-Value Network ===<br />
<br />
The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:<br />
<br />
* Algorithm: stochastic gradient descent<br />
* Batch size: 256<br />
* Momentum: 0.9<br />
* L2 regularization: 0.0001<br />
* Training time: ~100 epochs<br />
* Learning rate: initialized at 0.01, reduced twice<br />
<br />
A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:<br />
<br />
[[File:curling_loss_function.png | 300px]]<br />
<br />
== Self-Play Reinforcement Learning ==<br />
<br />
After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.<br />
<br />
At a game state ''s<sub>t</sub>'':<br />
<br />
1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.<br />
<br />
2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.<br />
<br />
It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.<br />
<br />
The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.<br />
<br />
It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.<br />
<br />
== Long-Term Strategy Learning ==<br />
<br />
Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.<br />
<br />
The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.<br />
<br />
== Final Algorithms ==<br />
<br />
The authors make use of the following versions of their algorithm:<br />
<br />
=== KR-DL ===<br />
<br />
''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.<br />
<br />
=== KR-DRL ===<br />
<br />
''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.<br />
<br />
=== KR-DRL-MES ===<br />
<br />
''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.<br />
<br />
= Testing and Results =<br />
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.<br />
<br />
== Comparison of KR-DL-UCT and DL-UCT ==<br />
<br />
The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.<br />
<br />
As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.<br />
<br />
<center>[[File:curling_KR_test.png | 400px]]</center><br />
<br />
== Matches ==<br />
<br />
Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. Figure 4 shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors. They also played some games between their KR-DRL-MES and notable<br />
programs. Table 1, shows the details of the match results. ''JiritsukunGAT'17'' shows a similar level of performance but KR-DRL-MES is still the winner.<br />
<br />
<br />
<br />
[[File:curling_ratings.png|600px|thumb|center|Figure 4. Elo rating and winning percentages of our models and GAT rankers. Each match has 200 games (each program plays 100 pre-ordered games), because the player which has the last shot (the hammer shot) in each end would have an advantage.]]<br />
<br />
<br />
[[File:ttt.png|600px|thumb|center|Table 1. The 8-end game results for KR-DRL-MES against other programs alternating the opening player each game. The matches are held by following the rules of the latest GAT competition.]]<br />
<br />
= Conclusion & Critique =<br />
<br />
The authors have presented a new framework which incorporates a deep neural network for learning game strategy with a kernel-based Monte Carlo tree search from a continuous space. Without the use of any hand-crafted feature, their policy-value network is successfully trained using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling. Following are my critiques on the paper:<br />
<br />
== Strengths ==<br />
<br />
This algorithm out-performs other high-performance algorithms (including past competition champions).<br />
<br />
I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.<br />
<br />
The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.<br />
<br />
== Weaknesses ==<br />
<br />
Somtimes, I found this paper difficult to follow. One problem was that the algorithms were introduced first, and then how they were used was described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run and at what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.<br />
<br />
While I think the comparison of different algorithms was done well, I believe it still lacked significant details. There were one-off mentioned in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.<br />
<br />
At this point, the algorithms used still rely on initialization by a pre-made program.<br />
<br />
There was little theoretical development or justification done in this paper.<br />
<br />
While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limited. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.<br />
<br />
While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.<br />
<br />
=References=<br />
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)<br />
# https://www.baeldung.com/java-monte-carlo-tree-search<br />
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/<br />
# https://int8.io/monte-carlo-tree-search-beginners-guide/<br />
# https://en.wikipedia.org/wiki/Monte_Carlo_tree_search<br />
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.<br />
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.<br />
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.<br />
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.<br />
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.<br />
# Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, ICML, pp. 448–456, 2015.<br />
# Nair, V. and Hinton, G. Rectified linear units improve restricted boltzmann machines.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=42151Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-11-30T23:15:48Z<p>Z43ma: </p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. Deep CNNs for large, non-convex continuous action spaces are not directly applicable. To solve this issue, we conduct a policy search with an efficient stochastic continuous action search on top of policy samples generated from a deep CNN. Our deep CNN still discretizes the state space and the action space. However, in<br />
the stochastic continuous action search, we lift the restriction of the deterministic discretization and conduct a local search procedure in a physical simulator with continuous action samples. In this way, the benefits of both deep neural networks and physical simulators can be realized.<br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game play, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Game play ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. <br />
<br />
<br />
Go game:<br />
* Start with 19x19 empty board<br />
* One player takes black stones and the other take white stones<br />
* Two players take turns to put stones on the board<br />
* Once the stone has been placed, the stones cannot be moved anymore<br />
* Rules:<br />
1. If one connected part is completely surrounded by the opponent's stones, remove it from the board<br />
<br />
2. Ko rule: Forbids a board play to repeat a board position<br />
* End when there are no valuable moves. <br />
* Count the territory of both players. The objective of the game is to capture more territory than your opponent. The player with black stone plays first. However, the black player needs to give 7.5 points to whites points (called Komi) as a tradeoff. There are some variations on how much points the player with the black stone should give based on different rules in different Asia countries.<br />
* This game used to be a huge challenge to artificial intelligence due to two reasons. One is the search space is extremely large. It is estimated to be on the order of (<math>10^{172}</math>), which is more than the number of atoms in the universe, and it is much larger than the game states in Chess (<math>10^{47}</math>). Another reason is there was no good heuristic function for evaluating a situation in Go. So the traditional alpha-beta pruning algorithm will not have good performance due to the poor heuristic function. For Alpha go lee, the CNN plays a role like a good heuristic function, which results on the huge performance improvement of the AI.<br />
[[File:go.JPG|700px|center]]<br />
<br />
Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.<br />
<br />
The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network. <br />
<br />
Finally, the two networks are combined using Monte-Carlo Tree Search, which performs a look-ahead search to select the actions for gameplay.<br />
<br />
The use of both policy and value networks are reflected in this paper's work.<br />
<br />
=== AlphaGo Zero ===<br />
<br />
AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.<br />
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.<br />
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.<br />
<br />
This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.<br />
<br />
This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.<br />
<br />
Other differences from the previous AlphaGo iterations are as follows. AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features. It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently. AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions. All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.<br />
<br />
The unification of networks and self-play are also reflected in this paper.<br />
<br />
=== Curling Algorithms ===<br />
<br />
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers. <br />
<br />
With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.<br />
<br />
=== Curling Physics and Simulation ===<br />
<br />
Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixed friction coefficient was predefined in the simulation. The behavior of the stones was also modeled. Important parameters are trained from professional players. The authors used the same parameters in this paper.<br />
<br />
== General Background of Algorithms ==<br />
<br />
=== Policy and Value Functions ===<br />
<br />
A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.<br />
<br />
POLICY IMPROVEMENT: LEARNING ACTION POLICY<br />
<br />
Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. Here <math> \sigma </math> denotes the weights of a neural network that approximates the policy. <math>s</math> denotes the set of states and <math>a</math> denotes the set of actions taken in the environment. The policy is a function that returns a action given the state at which the agent is present. The policy gradient reinforcement learning can be used to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,<br />
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]<br />
where <math> r(s_t) </math> is the return.<br />
<br />
POLICY EVALUATION: LEARNING VALUE FUNCTIONS<br />
<br />
A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,<br />
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).<br />
<br />
MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.<br />
<br />
Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.<br />
<br />
MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.<br />
<br />
The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.<br />
<br />
Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.<br />
<br />
The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.<br />
<br />
A selection function known as Upper Confidence Bound applied to Trees (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future. This formula serves the purpose of balance exploitation (first term) and exploration (second term) in Monte Carlo Tree Search. The philosophy is that nodes with high rewards and nodes poorly explored should both be explored more often.<br />
<br />
Note that the Upper Confidence Bound (UCB) formula can achieve the optimal solution of the multi-arm bandit problem theoretically.<br />
<br />
<math> \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} </math><br />
<br />
In which<br />
<br />
* <math> w_i = </math> number of wins after <math> i</math>th move<br />
* <math> n_i = </math> number of simulations after <math> i</math>th move<br />
* <math> c = </math> exploration parameter (theoritically eqal to <math> \sqrt{2}</math>)<br />
* <math> t = </math> total number of simulations for the parent node<br />
<br />
<br />
Sources: 2,3,4<br />
<br />
[[File:MCTS_Diagram.jpg | 500px|center]]<br />
<br />
=== Kernel Regression ===<br />
<br />
Kernel regression is a form of weighted averaging which uses a kernel function as a weight to estimate the conditional expectation of a random variable. Given two items of data, '''x''', each of which has a value '''y''' associated with them, and a choice of Kernel '''K''', the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.<br />
<br />
A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).<br />
<br />
[[File:gaussian_kernel.png | 400 px]]<br />
<br />
[[File:kernel_regression.png | 250 px]]<br />
<br />
The denominator of the conditional expectation is related to kernel density estimation, which is defined as <math display="inline">W(x)=\sum_{i=0}^n K(x,x_i)</math>.<br />
<br />
In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.<br />
<br />
= Methods =<br />
<br />
== Variable Definitions ==<br />
<br />
The following variables are used often in the paper:<br />
<br />
* <math>s</math>: A state in the game, as described below as the input to the network.<br />
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game<br />
* <math>a_t</math>: The action taken in state <math>s_t</math><br />
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS<br />
* <math>n_{a_t}</math>: The number of visits to node a in MCTS<br />
* <math>v_{a_t}</math>: The MCTS value estimate of a node<br />
<br />
== Network Design ==<br />
<br />
The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.<br />
<br />
=== Shared Structure ===<br />
<br />
The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:<br />
<br />
<br />
[[File:curling_network_layers.png|600px|thumb|center|Figure 2. A detail description of our policy-value network. The shared network is composed of one convolutional layer and nine residual blocks. Each residual block (explained in b) has two convolutional layer with batch normalization (Ioffe & Szegedy, 2015[11]) followed by the addition of the input and the residual block. Each layer in the shared network uses 3x3 filters. The policy head<br />
has two more convolutional layers, while the value head has two fully connected layers on top of a convolutional layer. For the activation function of each convolutional layer, ReLU (Nair & Hinton[12]) is used.]]<br />
<br />
<br />
<br />
the input to this network is the following:<br />
* Location of stones<br />
* Order to tee (the center of the sheet)<br />
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.<br />
<br />
The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.<br />
<br />
=== Policy Network ===<br />
<br />
The policy head is created by adding 2 convolutional layers with 2 (two) 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.<br />
<br />
[[File:policy-value-net.PNG | 700px]]<br />
<br />
=== Value Network ===<br />
<br />
The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.<br />
<br />
== Continuous Action Search ==<br />
<br />
The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.<br />
<br />
Actions in the continuous space are generated using an MCTS algorithm, with the following steps:<br />
<br />
=== Selection ===<br />
<br />
From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.<br />
<br />
[[File:curling_kernel_equations.png | 400px]]<br />
<br />
The UCB formula is then used to select an action to expand.<br />
<br />
The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.<br />
<br />
=== Expansion ===<br />
<br />
The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.<br />
<br />
=== Simulation ===<br />
<br />
Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.<br />
<br />
=== Backpropogation ===<br />
<br />
Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.<br />
<br />
<br />
== Supervised Learning ==<br />
<br />
During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.<br />
<br />
=== Policy Network ===<br />
<br />
The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.<br />
<br />
=== Value Network ===<br />
<br />
The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.<br />
<br />
=== Policy-Value Network ===<br />
<br />
The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:<br />
<br />
* Algorithm: stochastic gradient descent<br />
* Batch size: 256<br />
* Momentum: 0.9<br />
* L2 regularization: 0.0001<br />
* Training time: ~100 epochs<br />
* Learning rate: initialized at 0.01, reduced twice<br />
<br />
A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:<br />
<br />
[[File:curling_loss_function.png | 300px]]<br />
<br />
== Self-Play Reinforcement Learning ==<br />
<br />
After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.<br />
<br />
At a game state ''s<sub>t</sub>'':<br />
<br />
1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.<br />
<br />
2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.<br />
<br />
It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.<br />
<br />
The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.<br />
<br />
It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.<br />
<br />
== Long-Term Strategy Learning ==<br />
<br />
Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.<br />
<br />
The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.<br />
<br />
== Final Algorithms ==<br />
<br />
The authors make use of the following versions of their algorithm:<br />
<br />
=== KR-DL ===<br />
<br />
''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.<br />
<br />
=== KR-DRL ===<br />
<br />
''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.<br />
<br />
=== KR-DRL-MES ===<br />
<br />
''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.<br />
<br />
= Testing and Results =<br />
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.<br />
<br />
== Comparison of KR-DL-UCT and DL-UCT ==<br />
<br />
The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.<br />
<br />
As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.<br />
<br />
<center>[[File:curling_KR_test.png | 400px]]</center><br />
<br />
== Matches ==<br />
<br />
Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. Figure 4 shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors. They also played some games between their KR-DRL-MES and notable<br />
programs. Table 1, shows the details of the match results. ''JiritsukunGAT'17'' shows a similar level of performance but KR-DRL-MES is still the winner.<br />
<br />
<br />
<br />
[[File:curling_ratings.png|600px|thumb|center|Figure 4. Elo rating and winning percentages of our models and GAT rankers. Each match has 200 games (each program plays 100 pre-ordered games), because the player which has the last shot (the hammer shot) in each end would have an advantage.]]<br />
<br />
<br />
[[File:ttt.png|600px|thumb|center|Table 1. The 8-end game results for KR-DRL-MES against other programs alternating the opening player each game. The matches are held by following the rules of the latest GAT competition.]]<br />
<br />
= Conclusion & Critique =<br />
<br />
The authors have presented a new framework which incorporates a deep neural network for learning game strategy with a kernel-based Monte Carlo tree search from a continuous space. Without the use of any hand-crafted feature, their policy-value network is successfully trained using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling. Following are my critiques on the paper:<br />
<br />
== Strengths ==<br />
<br />
This algorithm out-performs other high-performance algorithms (including past competition champions).<br />
<br />
I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.<br />
<br />
The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.<br />
<br />
== Weaknesses ==<br />
<br />
Somtimes, I found this paper difficult to follow. One problem was that the algorithms were introduced first, and then how they were used was described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run and at what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.<br />
<br />
While I think the comparison of different algorithms was done well, I believe it still lacked significant details. There were one-off mentioned in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.<br />
<br />
At this point, the algorithms used still rely on initialization by a pre-made program.<br />
<br />
There was little theoretical development or justification done in this paper.<br />
<br />
While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limited. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.<br />
<br />
While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.<br />
<br />
=References=<br />
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)<br />
# https://www.baeldung.com/java-monte-carlo-tree-search<br />
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/<br />
# https://int8.io/monte-carlo-tree-search-beginners-guide/<br />
# https://en.wikipedia.org/wiki/Monte_Carlo_tree_search<br />
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.<br />
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.<br />
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.<br />
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.<br />
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.<br />
# Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, ICML, pp. 448–456, 2015.<br />
# Nair, V. and Hinton, G. Rectified linear units improve restricted boltzmann machines.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=42150Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-11-30T23:11:46Z<p>Z43ma: </p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. Deep CNNs for large, non-convex continuous action spaces are not directly applicable. To solve this issue, we conduct a policy search with an efficient stochastic continuous action search on top of policy samples generated from a deep CNN. Our deep CNN still discretizes the state space and the action space. However, in<br />
the stochastic continuous action search, we lift the restriction of the deterministic discretization and conduct a local search procedure in a physical simulator with continuous action samples. In this way, the benefits of both deep neural networks and physical simulators can be realized.<br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game play, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Game play ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. <br />
<br />
<br />
Go game:<br />
* Start with 19x19 empty board<br />
* One player takes black stones and the other take white stones<br />
* Two players take turns to put stones on the board<br />
* Once the stone has been placed, the stones cannot be moved anymore<br />
* Rules:<br />
1. If one connected part is completely surrounded by the opponent's stones, remove it from the board<br />
<br />
2. Ko rule: Forbids a board play to repeat a board position<br />
* End when there are no valuable moves. <br />
* Count the territory of both players. The objective of the game is to capture more territory than your opponent. The player with black stone plays first. However, the black player needs to give 7.5 points to whites points (called Komi) as a tradeoff. There are some variations on how much points the player with the black stone should give based on different rules in different Asia countries.<br />
* This game used to be a huge challenge to artificial intelligence due to two reasons. One is the search space is extremely large. It is estimated to be on the order of (<math>10^{172}</math>), which is more than the number of atoms in the universe, and it is much larger than the game states in Chess (<math>10^{47}</math>). Another reason is there was no good heuristic function for evaluating a situation in Go. So the traditional alpha-beta pruning algorithm will not have good performance due to the poor heuristic function. For Alpha go lee, the CNN plays a role like a good heuristic function, which results on the huge performance improvement of the AI.<br />
[[File:go.JPG|700px|center]]<br />
<br />
Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.<br />
<br />
The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network. <br />
<br />
Finally, the two networks are combined using Monte-Carlo Tree Search, which performs a look-ahead search to select the actions for gameplay.<br />
<br />
The use of both policy and value networks are reflected in this paper's work.<br />
<br />
=== AlphaGo Zero ===<br />
<br />
AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.<br />
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.<br />
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.<br />
<br />
This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.<br />
<br />
This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.<br />
<br />
Other differences from the previous AlphaGo iterations are as follows. AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features. It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently. AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions. All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.<br />
<br />
The unification of networks and self-play are also reflected in this paper.<br />
<br />
=== Curling Algorithms ===<br />
<br />
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers. <br />
<br />
With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.<br />
<br />
=== Curling Physics and Simulation ===<br />
<br />
Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixed friction coefficient was predefined in the simulation. The behavior of the stones was also modeled. Important parameters are trained from professional players. The authors used the same parameters in this paper.<br />
<br />
== General Background of Algorithms ==<br />
<br />
=== Policy and Value Functions ===<br />
<br />
A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.<br />
<br />
POLICY IMPROVEMENT: LEARNING ACTION POLICY<br />
<br />
Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. Here <math> \sigma </math> denotes the weights of a neural network that approximates the policy. <math>s</math> denotes the set of states and <math>a</math> denotes the set of actions taken in the environment. The policy is a function that returns a action given the state at which the agent is present. The policy gradient reinforcement learning can be used to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,<br />
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]<br />
where <math> r(s_t) </math> is the return.<br />
<br />
POLICY EVALUATION: LEARNING VALUE FUNCTIONS<br />
<br />
A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,<br />
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).<br />
<br />
MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.<br />
<br />
Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.<br />
<br />
MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.<br />
<br />
The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.<br />
<br />
Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.<br />
<br />
The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.<br />
<br />
A selection function known as Upper Confidence Bound (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future.<br />
<br />
<math> \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} </math><br />
<br />
In which<br />
<br />
* <math> w_i = </math> number of wins after <math> i</math>th move<br />
* <math> n_i = </math> number of simulations after <math> i</math>th move<br />
* <math> c = </math> exploration parameter (theoritically eqal to <math> \sqrt{2}</math>)<br />
* <math> t = </math> total number of simulations for the parent node<br />
<br />
<br />
Sources: 2,3,4<br />
<br />
[[File:MCTS_Diagram.jpg | 500px|center]]<br />
<br />
=== Kernel Regression ===<br />
<br />
Kernel regression is a form of weighted averaging which uses a kernel function as a weight to estimate the conditional expectation of a random variable. Given two items of data, '''x''', each of which has a value '''y''' associated with them, and a choice of Kernel '''K''', the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.<br />
<br />
A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).<br />
<br />
[[File:gaussian_kernel.png | 400 px]]<br />
<br />
[[File:kernel_regression.png | 250 px]]<br />
<br />
The denominator of the conditional expectation is related to kernel density estimation, which is defined as <math display="inline">W(x)=\sum_{i=0}^n K(x,x_i)</math>.<br />
<br />
In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.<br />
<br />
= Methods =<br />
<br />
== Variable Definitions ==<br />
<br />
The following variables are used often in the paper:<br />
<br />
* <math>s</math>: A state in the game, as described below as the input to the network.<br />
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game<br />
* <math>a_t</math>: The action taken in state <math>s_t</math><br />
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS<br />
* <math>n_{a_t}</math>: The number of visits to node a in MCTS<br />
* <math>v_{a_t}</math>: The MCTS value estimate of a node<br />
<br />
== Network Design ==<br />
<br />
The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.<br />
<br />
=== Shared Structure ===<br />
<br />
The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:<br />
<br />
<br />
[[File:curling_network_layers.png|600px|thumb|center|Figure 2. A detail description of our policy-value network. The shared network is composed of one convolutional layer and nine residual blocks. Each residual block (explained in b) has two convolutional layer with batch normalization (Ioffe & Szegedy, 2015[11]) followed by the addition of the input and the residual block. Each layer in the shared network uses 3x3 filters. The policy head<br />
has two more convolutional layers, while the value head has two fully connected layers on top of a convolutional layer. For the activation function of each convolutional layer, ReLU (Nair & Hinton[12]) is used.]]<br />
<br />
<br />
<br />
the input to this network is the following:<br />
* Location of stones<br />
* Order to tee (the center of the sheet)<br />
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.<br />
<br />
The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.<br />
<br />
=== Policy Network ===<br />
<br />
The policy head is created by adding 2 convolutional layers with 2 (two) 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.<br />
<br />
[[File:policy-value-net.PNG | 700px]]<br />
<br />
=== Value Network ===<br />
<br />
The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.<br />
<br />
== Continuous Action Search ==<br />
<br />
The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.<br />
<br />
Actions in the continuous space are generated using an MCTS algorithm, with the following steps:<br />
<br />
=== Selection ===<br />
<br />
From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.<br />
<br />
[[File:curling_kernel_equations.png | 400px]]<br />
<br />
The UCB formula is then used to select an action to expand.<br />
<br />
The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.<br />
<br />
=== Expansion ===<br />
<br />
The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.<br />
<br />
=== Simulation ===<br />
<br />
Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.<br />
<br />
=== Backpropogation ===<br />
<br />
Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.<br />
<br />
<br />
== Supervised Learning ==<br />
<br />
During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.<br />
<br />
=== Policy Network ===<br />
<br />
The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.<br />
<br />
=== Value Network ===<br />
<br />
The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.<br />
<br />
=== Policy-Value Network ===<br />
<br />
The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:<br />
<br />
* Algorithm: stochastic gradient descent<br />
* Batch size: 256<br />
* Momentum: 0.9<br />
* L2 regularization: 0.0001<br />
* Training time: ~100 epochs<br />
* Learning rate: initialized at 0.01, reduced twice<br />
<br />
A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:<br />
<br />
[[File:curling_loss_function.png | 300px]]<br />
<br />
== Self-Play Reinforcement Learning ==<br />
<br />
After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.<br />
<br />
At a game state ''s<sub>t</sub>'':<br />
<br />
1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.<br />
<br />
2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.<br />
<br />
It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.<br />
<br />
The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.<br />
<br />
It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.<br />
<br />
== Long-Term Strategy Learning ==<br />
<br />
Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.<br />
<br />
The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.<br />
<br />
== Final Algorithms ==<br />
<br />
The authors make use of the following versions of their algorithm:<br />
<br />
=== KR-DL ===<br />
<br />
''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.<br />
<br />
=== KR-DRL ===<br />
<br />
''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.<br />
<br />
=== KR-DRL-MES ===<br />
<br />
''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.<br />
<br />
= Testing and Results =<br />
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.<br />
<br />
== Comparison of KR-DL-UCT and DL-UCT ==<br />
<br />
The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.<br />
<br />
As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.<br />
<br />
<center>[[File:curling_KR_test.png | 400px]]</center><br />
<br />
== Matches ==<br />
<br />
Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. Figure 4 shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors. They also played some games between their KR-DRL-MES and notable<br />
programs. Table 1, shows the details of the match results. ''JiritsukunGAT'17'' shows a similar level of performance but KR-DRL-MES is still the winner.<br />
<br />
<br />
<br />
[[File:curling_ratings.png|600px|thumb|center|Figure 4. Elo rating and winning percentages of our models and GAT rankers. Each match has 200 games (each program plays 100 pre-ordered games), because the player which has the last shot (the hammer shot) in each end would have an advantage.]]<br />
<br />
<br />
[[File:ttt.png|600px|thumb|center|Table 1. The 8-end game results for KR-DRL-MES against other programs alternating the opening player each game. The matches are held by following the rules of the latest GAT competition.]]<br />
<br />
= Conclusion & Critique =<br />
<br />
The authors have presented a new framework which incorporates a deep neural network for learning game strategy with a kernel-based Monte Carlo tree search from a continuous space. Without the use of any hand-crafted feature, their policy-value network is successfully trained using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling. Following are my critiques on the paper:<br />
<br />
== Strengths ==<br />
<br />
This algorithm out-performs other high-performance algorithms (including past competition champions).<br />
<br />
I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.<br />
<br />
The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.<br />
<br />
== Weaknesses ==<br />
<br />
Somtimes, I found this paper difficult to follow. One problem was that the algorithms were introduced first, and then how they were used was described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run and at what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.<br />
<br />
While I think the comparison of different algorithms was done well, I believe it still lacked significant details. There were one-off mentioned in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.<br />
<br />
At this point, the algorithms used still rely on initialization by a pre-made program.<br />
<br />
There was little theoretical development or justification done in this paper.<br />
<br />
While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limited. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.<br />
<br />
While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.<br />
<br />
=References=<br />
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)<br />
# https://www.baeldung.com/java-monte-carlo-tree-search<br />
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/<br />
# https://int8.io/monte-carlo-tree-search-beginners-guide/<br />
# https://en.wikipedia.org/wiki/Monte_Carlo_tree_search<br />
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.<br />
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.<br />
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.<br />
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.<br />
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.<br />
# Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, ICML, pp. 448–456, 2015.<br />
# Nair, V. and Hinton, G. Rectified linear units improve restricted boltzmann machines.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=42149Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-11-30T23:11:15Z<p>Z43ma: </p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. Deep CNNs for large, non-convex continuous action spaces are not directly applicable. To solve this issue, we conduct a policy search with an efficient stochastic continuous action search on top of policy samples generated from a deep CNN. Our deep CNN still discretizes the state space and the action space. However, in<br />
the stochastic continuous action search, we lift the restriction of the deterministic discretization and conduct a local search procedure in a physical simulator with continuous action samples. In this way, the benefits of both deep neural networks and physical simulators can be realized.<br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game play, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Game play ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. <br />
<br />
<br />
Go game:<br />
* Start with 19x19 empty board<br />
* One player takes black stones and the other take white stones<br />
* Two players take turns to put stones on the board<br />
* Once the stone has been placed, the stones cannot be moved anymore<br />
* Rules:<br />
1. If one connected part is completely surrounded by the opponent's stones, remove it from the board<br />
<br />
2. Ko rule: Forbids a board play to repeat a board position<br />
* End when there are no valuable moves. <br />
* Count the territory of both players. The objective of the game is to capture more territory than your opponent. The player with black stone plays first. However, the black player needs to give 7.5 points to whites points (called Komi) as a tradeoff. There are some variations on how much points the player with the black stone should give based on different rules in different Asia countries.<br />
* This game used to be a huge challenge to artificial intelligence due to two reasons. One is the search space is extremely large. It is estimated to be on the order of (<math>10^172</math>), which is more than the number of atoms in the universe, and it is much larger than the game states in Chess (<math>10^47</math>). Another reason is there was no good heuristic function for evaluating a situation in Go. So the traditional alpha-beta pruning algorithm will not have good performance due to the poor heuristic function. For Alpha go lee, the CNN plays a role like a good heuristic function, which results on the huge performance improvement of the AI.<br />
[[File:go.JPG|700px|center]]<br />
<br />
Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.<br />
<br />
The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network. <br />
<br />
Finally, the two networks are combined using Monte-Carlo Tree Search, which performs a look-ahead search to select the actions for gameplay.<br />
<br />
The use of both policy and value networks are reflected in this paper's work.<br />
<br />
=== AlphaGo Zero ===<br />
<br />
AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.<br />
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.<br />
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.<br />
<br />
This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.<br />
<br />
This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.<br />
<br />
Other differences from the previous AlphaGo iterations are as follows. AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features. It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently. AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions. All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.<br />
<br />
The unification of networks and self-play are also reflected in this paper.<br />
<br />
=== Curling Algorithms ===<br />
<br />
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers. <br />
<br />
With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.<br />
<br />
=== Curling Physics and Simulation ===<br />
<br />
Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixed friction coefficient was predefined in the simulation. The behavior of the stones was also modeled. Important parameters are trained from professional players. The authors used the same parameters in this paper.<br />
<br />
== General Background of Algorithms ==<br />
<br />
=== Policy and Value Functions ===<br />
<br />
A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.<br />
<br />
POLICY IMPROVEMENT: LEARNING ACTION POLICY<br />
<br />
Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. Here <math> \sigma </math> denotes the weights of a neural network that approximates the policy. <math>s</math> denotes the set of states and <math>a</math> denotes the set of actions taken in the environment. The policy is a function that returns a action given the state at which the agent is present. The policy gradient reinforcement learning can be used to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,<br />
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]<br />
where <math> r(s_t) </math> is the return.<br />
<br />
POLICY EVALUATION: LEARNING VALUE FUNCTIONS<br />
<br />
A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,<br />
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).<br />
<br />
MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.<br />
<br />
Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.<br />
<br />
MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.<br />
<br />
The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.<br />
<br />
Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.<br />
<br />
The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.<br />
<br />
A selection function known as Upper Confidence Bound (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future.<br />
<br />
<math> \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} </math><br />
<br />
In which<br />
<br />
* <math> w_i = </math> number of wins after <math> i</math>th move<br />
* <math> n_i = </math> number of simulations after <math> i</math>th move<br />
* <math> c = </math> exploration parameter (theoritically eqal to <math> \sqrt{2}</math>)<br />
* <math> t = </math> total number of simulations for the parent node<br />
<br />
<br />
Sources: 2,3,4<br />
<br />
[[File:MCTS_Diagram.jpg | 500px|center]]<br />
<br />
=== Kernel Regression ===<br />
<br />
Kernel regression is a form of weighted averaging which uses a kernel function as a weight to estimate the conditional expectation of a random variable. Given two items of data, '''x''', each of which has a value '''y''' associated with them, and a choice of Kernel '''K''', the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.<br />
<br />
A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).<br />
<br />
[[File:gaussian_kernel.png | 400 px]]<br />
<br />
[[File:kernel_regression.png | 250 px]]<br />
<br />
The denominator of the conditional expectation is related to kernel density estimation, which is defined as <math display="inline">W(x)=\sum_{i=0}^n K(x,x_i)</math>.<br />
<br />
In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.<br />
<br />
= Methods =<br />
<br />
== Variable Definitions ==<br />
<br />
The following variables are used often in the paper:<br />
<br />
* <math>s</math>: A state in the game, as described below as the input to the network.<br />
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game<br />
* <math>a_t</math>: The action taken in state <math>s_t</math><br />
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS<br />
* <math>n_{a_t}</math>: The number of visits to node a in MCTS<br />
* <math>v_{a_t}</math>: The MCTS value estimate of a node<br />
<br />
== Network Design ==<br />
<br />
The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.<br />
<br />
=== Shared Structure ===<br />
<br />
The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:<br />
<br />
<br />
[[File:curling_network_layers.png|600px|thumb|center|Figure 2. A detail description of our policy-value network. The shared network is composed of one convolutional layer and nine residual blocks. Each residual block (explained in b) has two convolutional layer with batch normalization (Ioffe & Szegedy, 2015[11]) followed by the addition of the input and the residual block. Each layer in the shared network uses 3x3 filters. The policy head<br />
has two more convolutional layers, while the value head has two fully connected layers on top of a convolutional layer. For the activation function of each convolutional layer, ReLU (Nair & Hinton[12]) is used.]]<br />
<br />
<br />
<br />
the input to this network is the following:<br />
* Location of stones<br />
* Order to tee (the center of the sheet)<br />
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.<br />
<br />
The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.<br />
<br />
=== Policy Network ===<br />
<br />
The policy head is created by adding 2 convolutional layers with 2 (two) 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.<br />
<br />
[[File:policy-value-net.PNG | 700px]]<br />
<br />
=== Value Network ===<br />
<br />
The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.<br />
<br />
== Continuous Action Search ==<br />
<br />
The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.<br />
<br />
Actions in the continuous space are generated using an MCTS algorithm, with the following steps:<br />
<br />
=== Selection ===<br />
<br />
From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.<br />
<br />
[[File:curling_kernel_equations.png | 400px]]<br />
<br />
The UCB formula is then used to select an action to expand.<br />
<br />
The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.<br />
<br />
=== Expansion ===<br />
<br />
The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.<br />
<br />
=== Simulation ===<br />
<br />
Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.<br />
<br />
=== Backpropogation ===<br />
<br />
Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.<br />
<br />
<br />
== Supervised Learning ==<br />
<br />
During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.<br />
<br />
=== Policy Network ===<br />
<br />
The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.<br />
<br />
=== Value Network ===<br />
<br />
The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.<br />
<br />
=== Policy-Value Network ===<br />
<br />
The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:<br />
<br />
* Algorithm: stochastic gradient descent<br />
* Batch size: 256<br />
* Momentum: 0.9<br />
* L2 regularization: 0.0001<br />
* Training time: ~100 epochs<br />
* Learning rate: initialized at 0.01, reduced twice<br />
<br />
A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:<br />
<br />
[[File:curling_loss_function.png | 300px]]<br />
<br />
== Self-Play Reinforcement Learning ==<br />
<br />
After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.<br />
<br />
At a game state ''s<sub>t</sub>'':<br />
<br />
1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.<br />
<br />
2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.<br />
<br />
It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.<br />
<br />
The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.<br />
<br />
It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.<br />
<br />
== Long-Term Strategy Learning ==<br />
<br />
Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.<br />
<br />
The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.<br />
<br />
== Final Algorithms ==<br />
<br />
The authors make use of the following versions of their algorithm:<br />
<br />
=== KR-DL ===<br />
<br />
''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.<br />
<br />
=== KR-DRL ===<br />
<br />
''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.<br />
<br />
=== KR-DRL-MES ===<br />
<br />
''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.<br />
<br />
= Testing and Results =<br />
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.<br />
<br />
== Comparison of KR-DL-UCT and DL-UCT ==<br />
<br />
The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.<br />
<br />
As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.<br />
<br />
<center>[[File:curling_KR_test.png | 400px]]</center><br />
<br />
== Matches ==<br />
<br />
Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. Figure 4 shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors. They also played some games between their KR-DRL-MES and notable<br />
programs. Table 1, shows the details of the match results. ''JiritsukunGAT'17'' shows a similar level of performance but KR-DRL-MES is still the winner.<br />
<br />
<br />
<br />
[[File:curling_ratings.png|600px|thumb|center|Figure 4. Elo rating and winning percentages of our models and GAT rankers. Each match has 200 games (each program plays 100 pre-ordered games), because the player which has the last shot (the hammer shot) in each end would have an advantage.]]<br />
<br />
<br />
[[File:ttt.png|600px|thumb|center|Table 1. The 8-end game results for KR-DRL-MES against other programs alternating the opening player each game. The matches are held by following the rules of the latest GAT competition.]]<br />
<br />
= Conclusion & Critique =<br />
<br />
The authors have presented a new framework which incorporates a deep neural network for learning game strategy with a kernel-based Monte Carlo tree search from a continuous space. Without the use of any hand-crafted feature, their policy-value network is successfully trained using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling. Following are my critiques on the paper:<br />
<br />
== Strengths ==<br />
<br />
This algorithm out-performs other high-performance algorithms (including past competition champions).<br />
<br />
I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.<br />
<br />
The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.<br />
<br />
== Weaknesses ==<br />
<br />
Somtimes, I found this paper difficult to follow. One problem was that the algorithms were introduced first, and then how they were used was described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run and at what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.<br />
<br />
While I think the comparison of different algorithms was done well, I believe it still lacked significant details. There were one-off mentioned in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.<br />
<br />
At this point, the algorithms used still rely on initialization by a pre-made program.<br />
<br />
There was little theoretical development or justification done in this paper.<br />
<br />
While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limited. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.<br />
<br />
While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.<br />
<br />
=References=<br />
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)<br />
# https://www.baeldung.com/java-monte-carlo-tree-search<br />
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/<br />
# https://int8.io/monte-carlo-tree-search-beginners-guide/<br />
# https://en.wikipedia.org/wiki/Monte_Carlo_tree_search<br />
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.<br />
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.<br />
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.<br />
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.<br />
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.<br />
# Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, ICML, pp. 448–456, 2015.<br />
# Nair, V. and Hinton, G. Rectified linear units improve restricted boltzmann machines.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CapsuleNets&diff=42148CapsuleNets2018-11-30T23:03:37Z<p>Z43ma: </p>
<hr />
<div>The paper "Dynamic Routing Between Capsules" was written by three researchers at Google Brain: Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. This paper was published and presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California. The same three researchers recently published a highly related paper "[https://openreview.net/pdf?id=HJWLfGWRb Matrix Capsules with EM Routing]" for ICLR 2018.<br />
<br />
=Motivation=<br />
<br />
Ever since AlexNet eclipsed the performance of competing architectures in the 2012 ImageNet challenge, convolutional neural networks have maintained their dominance in computer vision applications. Despite the recent successes and innovations brought about by convolutional neural networks, some assumptions made in these networks are perhaps unwarranted and deficient. Using a novel neural network architecture, the authors create CapsuleNets, a network that they claim is able to learn image representations in a more robust, human-like manner. With only a 3 layer capsule network, they achieved near state-of-the-art results on MNIST.<br />
<br />
The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. This paper explores an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity. The length of the vector output of a capsule cannot exceed 1 because of an application of a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.<br />
<br />
The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. The authors demonstrate that their dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects<br />
<br />
==Adversarial Examples==<br />
<br />
First discussed by Christian Szegedy et. al. in late 2013, adversarial examples have been heavily discussed by the deep learning community as a potential security threat to AI learning. Adversarial examples are defined as inputs that an attacker creates intentionally fool a machine learning model. An example of an adversarial example is shown below: <br />
<br />
[[File:adversarial_img_1.png |center]]<br />
<br />
To human eyes, the image appears to be a panda both before and after noise is injected into the image, whereas the trained ConvNet model discerns the noisy image as a Gibbon with almost 100% certainty. The fact that the network is unable to classify the above image as a panda after the epsilon perturbation leads to many potential security risks in AI dependent systems such as self-driving vehicles. Although various methods have been suggested to combat adversarial examples, robust defenses are hard to construct due to the inherent difficulties in constructing theoretical models for the adversarial example crafting process. However, beyond the fact that these examples may serve as a security threat, it emphasizes that these convolutional neural networks do not learn image classification/object detection patterns the same way that a human would. Rather than identifying the core features of a panda such as its eyes, mouth, nose, and the gradient changes in its black/white fur, the convolutional neural network seems to be learning image representations in a completely different manner. Deep learning researchers often attempt to model neural networks after human learning, and it is clear that further steps must be taken to robustify ConvNets against targeted noise perturbations.<br />
<br />
==Drawbacks of CNNs==<br />
Hinton claims that the key fault with traditional CNNs lies within the pooling function. Although pooling builds translational invariance into the network, it fails to preserve spatial relationships between objects. When we pool, we effectively reduce a <math>k \cdot k</math> kernel of convolved cells into a scalar input. This results in a desired local invariance without inhibiting the network's ability to detect features but causes valuable spatial information to be lost.<br />
<br />
Also, in CNNs, higher-level features combine lower-level features as a weighted sum: activations of a previous layer multiplied by the current layer's weight, then passed to another activation function. In this process, pose relationship between simpler features is not part of the higher-level feature.<br />
<br />
In the example below, the network is able to detect the similar features (eyes, mouth, nose, etc) within both images, but fails to recognize that one image is a human face, while the other is a Picasso-esque due to the CNN's inability to encode spatial relationships after multiple pooling layers.<br />
In deep learning, the activation level of a neuron is often interpreted as the likelihood of detecting a specific feature. CNNs are good at detecting features but less effective at exploring the spatial relationships among features (perspective, size, orientation). <br />
<br />
[[File:Equivariance Face.png |center]]<br />
<br />
Here, the CNN could wrongly activate the neuron for the face detection. Without realizing the mismatch in spatial orientation and size, the activation for the face detection will be too high.<br />
<br />
Conversely, we hope that a CNN can recognize that both of the following pictures contain a kitten. Unfortunately, when we feed the two images into a ResNet50 architecture, only the first image is correctly classified, while the second image is predicted to be a guinea pig.<br />
<br />
<br />
[[File:kitten.jpeg |center]]<br />
<br />
<br />
[[File:kitten-rotated-180.jpg |center]]<br />
<br />
For a more in depth discussion on the problems with ConvNets, please listen to Geoffrey Hinton's talk "What is wrong with convolutional neural nets?" given at MIT during the Brain & Cognitive Sciences - Fall Colloquium Series (December 4, 2014).<br />
<br />
==Intuition for Capsules==<br />
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Hinton argues that our brains reason visual information by deconstructing it into a hierarchical representation which we then match to familiar patterns and relationships from memory. The key difference between this understanding and the functionality of CNNs is that recognition of an object should not depend on the angle from which it is viewed. <br />
<br />
To enforce rotational and translational equivariance, Capsule Networks store and preserve hierarchical pose relationships between objects. The core idea behind capsule theory is the explicit numerical representations of relative relationships between different objects within an image. Building these relationships into the Capsule Networks model, the network is able to recognize newly seen objects as a rotated view of a previously seen object. For example, the below image shows the Statue of Liberty under five different angles. If a person had only seen the Statue of Liberty from one angle, they would be able to ascertain that all five pictures below contain the same object (just from a different angle).<br />
<br />
[[File:Rotational Invariance.jpeg |center]]<br />
<br />
Building on this idea of hierarchical representation of spatial relationships between key entities within an image, the authors introduce Capsule Networks. Unlike traditional CNNs, Capsule Networks are better equipped to classify correctly under rotational invariance. Furthermore, the authors managed to achieve state of the art results on MNIST using a fraction of the training samples that alternative state of the art networks requires.<br />
<br />
=Background, Notation, and Definitions=<br />
<br />
==What is a Capsule==<br />
"Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting, and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold."<br />
<br />
In essence, capsules store object properties in a vector form; probability of detection is encoded as the vector's length, while spatial properties are encoded as the individual vector components. Thus, when a feature is present but the image captures it under a different angle, the probability of detection remains unchanged.<br />
<br />
A brief overview/understanding of capsules can be found in other papers from the author. To quote from [https://openreview.net/pdf?id=HJWLfGWRb this paper]:<br />
<br />
<blockquote><br />
A capsule network consists of several layers of capsules. The set of capsules in layer L is denoted<br />
as <math>\Omega_L</math>. Each capsule has a 4x4 pose matrix, <math>M</math>, and an activation probability, <math>a</math>. These are like the<br />
activities in a standard neural net: they depend on the current input and are not stored. In between<br />
each capsule i in layer L and each capsule j in layer L + 1 is a 4x4 trainable transformation matrix,<br />
<math>W_{ij}</math> . These <math>W_{ij}</math>'s (and two learned biases per capsule) are the only stored parameters and they<br />
are learned discriminatively. The pose matrix of capsule i is transformed by <math>W_{ij}</math> to cast a vote<br />
<math>V_{ij} = M_iW_{ij}</math> for the pose matrix of capsule j. The poses and activations of all the capsules in layer<br />
L + 1 are calculated by using a non-linear routing procedure which gets as input <math>V_{ij}</math> and <math>a_i</math> for all<br />
<math>i \in \Omega_L, j \in \Omega_{L+1}</math><br />
</blockquote><br />
<math></math><br />
<br />
==Notation==<br />
<br />
We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The paper performs a non-linear squashing operation to ensure that vector length falls between 0 and 1, with shorter vectors (less likely to exist entities) being shrunk towards 0. <br />
<br />
\begin{align} \mathbf{v}_j &= \frac{||\mathbf{s}_j||^2}{1+ ||\mathbf{s}_j||^2} \frac{\mathbf{s}_j}{||\mathbf{s}_j||} \end{align}<br />
<br />
where <math>\mathbf{v}_j</math> is the vector output of capsule <math>j</math> and <math>s_j</math> is its total input.<br />
<br />
For all but the first layer of capsules, the total input to a capsule <math>s_j</math> is a weighted sum over all “prediction vectors” <math>\hat{\mathbf{u}}_{j|i}</math> from the capsules in the layer below and is produced by multiplying the output <math>\mathbf{u}i</math> of a capsule in the layer below by a weight matrix <math>\mathbf{W}ij</math><br />
<br />
\begin{align}<br />
\mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, ~\hspace{0.5em} \hat{\mathbf{u}}_{j|i}= \mathbf{W}_{ij}\mathbf{u}_i<br />
\end{align}<br />
where the <math>c_{ij}</math> are coupling coefficients that are determined by the iterative dynamic routing process.<br />
<br />
The coupling coefficients between capsule <math>i</math> and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits <math>b_{ij}</math> are the log prior probabilities that capsule <math>i</math> should be coupled to capsule <math>j</math>.<br />
<br />
\begin{align}<br />
c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}<br />
\end{align}<br />
<br />
=Network Training and Dynamic Routing=<br />
<br />
==Understanding Capsules==<br />
The notation can get somewhat confusing, so I will provide intuition behind the computational steps within a capsule. The following image is taken from naturomic's talk on Capsule Networks.<br />
<br />
[[File:CapsuleNets.jpeg|center|800px]]<br />
<br />
The above image illustrates the key mathematical operations happening within a capsule (and compares them to the structure of a neuron). Although the operations are rather straightforward, it's crucial to note that the capsule computes an affine transformation onto each input vector. The length of the input vectors <math>\mathbf{u}_{i}</math> represent the probability of entity <math>i</math> existing in a lower level. This vector is then reoriented with an affine transform using <math>\mathbf{W}_{ij}</math> matrices that encode spatial relationships between entity <math>\mathbf{u}_{i}</math> and other lower level features.<br />
<br />
We illustrate the intuition behind vector-to-vector matrix multiplication within capsules using the following example: if vectors <math>\mathbf{u}_{1}</math>, <math>\mathbf{u}_{2}</math>, and <math>\mathbf{u}_{3}</math> represent detection of eyes, nose, and mouth respectively, then after multiplication with trained weight matrices <math>\mathbf{W}_{ij}</math> (where j denotes existence of a face), we should get a general idea of the general location of the higher level feature (face), similar to the image below.<br />
<br />
[[File:Predictions.jpeg |center]]<br />
<br />
==Dynamic Routing==<br />
A capsule <math>i</math> in a lower-level layer needs to decide how to send its output vector to higher-level capsules <math>j</math>. This decision is made with probability proportional to <math>c_{ij}</math>. If there are <math>K</math> capsules in the level that capsule <math>i</math> routes to, then we know the following properties about <math>c_{ij}</math>: <math>\sum_{j=1}^M c_{ij} = 1, c_{ij} \geq 0</math><br />
<br />
In essence, the <math>\{c_{ij}\}_{j=1}^M</math> denotes a discrete probability distribution with respect to capsule <math>i</math>'s output location. Lower level capsules decide which higher level capsules to send vectors into by adjusting the corresponding routing weights <math>\{c_{ij}\}_{j=1}^M</math>. After a few iterations in training, numerous vectors will have already been sent to all higher level capsules. Based on the similarity between the current vector being routed and all vectors already sent into the higher level capsules, we decide which capsule to send the current vector into.<br />
[[File:Dynamic Routing.png|center|900px]]<br />
<br />
From the image above, we notice that a cluster of points similar to the current vector has already been routed into capsule K, while most points in capsule J are highly dissimilar. It thus makes more sense to route the current observations into capsule K; we adjust the corresponding weights upward during training.<br />
<br />
These weights are determined through the dynamic routing procedure:<br />
<br />
<br />
[[File:Routing Algo.png|900px]]<br />
<br />
Note that the convergence of this routing procedure has been questioned. Although it is empirically shown that this procedure converges, the convergence has not been proven.<br />
<br />
Although dynamic routing is not the only manner in which we can encode relationships between capsules, the premise of the paper is to demonstrate the capabilities of capsules under a simple implementation. Since the paper was released in 2017, numerous alternative routing implementations have been released including an EM matrix routing algorithm by the same authors (ICLR 2018).<br />
<br />
=Architecture=<br />
The capsule network architecture given by the authors has 11.36 million trainable parameters. The paper itself is not very detailed on exact implementation of each architectural layer, and hence it leaves some degree of ambiguity on coding various aspects of the original network. The capsule network has 6 overall layers, with the first three layers denoting components of the encoder, and the last 3 denoting components of the decoder.<br />
<br />
==Loss Function==<br />
[[File:Loss Function.png|900px]]<br />
<br />
The cost function looks very complicated, but can be broken down into intuitive components. Before diving into the equation, remember that the length of the vector denotes the probability of object existence. The left side of the equation denotes loss when the network classifies an observation correctly; the term becomes zero when the classification is incorrect. To compute loss when the network correctly classifies the label, we subtract the vector norm from a fixed quantity <math>m^+ := 0.9</math>. On the other hand, when the network classifies a label incorrectly, we penalize the loss based on the network's confidence in the incorrect label; we compute the loss by subtracting <math>m^- := 0.1</math> from the vector norm.<br />
<br />
A graphical representation of loss function values under varying vector norms is given below.<br />
[[File:Loss function chart.png|900px]]<br />
<br />
==Encoder Layers==<br />
All experiments within this paper were conducted on the MNIST dataset, and thus the architecture is built to classify the corresponding dataset. For more complex datasets, the experiments were less promising. <br />
<br />
[[File:Architecture.png|center|900px]]<br />
<br />
The encoder layer takes in a 28x28 MNIST image and learns a 16 dimensional representation of instantiation parameters.<br />
<br />
'''Layer 1: Convolution''': <br />
This layer is a standard convolution layer. Using kernels with size 9x9x1, a stride of 1, and a ReLU activation function, we detect the 2D features within the network.<br />
<br />
'''Layer 2: PrimaryCaps''': <br />
We represent the low level features detected during convolution as 32 primary capsules. Each capsule applies eight convolutional kernels with stride 2 to the output of the convolution layer and feeds the corresponding transformed tensors into the DigiCaps layer.<br />
<br />
'''Layer 3: DigiCaps''': <br />
This layer contains 10 digit capsules, one for each digit. As explained in the dynamic routing procedure, each input vector from the PrimaryCaps layer has its own corresponding weight matrix <math>W_{ij}</math>. Using the routing coefficients <math>c_{ij}</math> and temporary coefficients <math>b_{ij}</math>, we train the DigiCaps layer to output a ten 16 dimensional vectors. The length of the <math>i^{th}</math> vector in this layer corresponds to the probability of detection of digit <math>i</math>.<br />
<br />
==Decoder Layers==<br />
The decoder layer aims to train the capsules to extract meaningful features for image detection/classification. During training, it takes the 16 layer instantiation vector of the correct (not predicted) DigiCaps layer, and attempts to recreate the 28x28 MNIST image as best as possible. Setting the loss function as reconstruction error (Euclidean distance between the reconstructed image and original image), we tune the capsules to encode features that are meaningful within the actual image.<br />
<br />
[[File:Decoder.png|center|900px]]<br />
<br />
The layer consists of three fully connected layers, and transforms a 16x1 vector from the encoder layer into a 28x28 image.<br />
<br />
In addition to the digicaps loss function, we add reconstruction error as a form of regularization. During training, everything but the activity vector of the correct digit capsule is masked, and then this activity vector is used to reconstruct the input image. We minimize the Euclidean distance between the outputs of the logistic units and the pixel intensities of the original and reconstructed images. We scale down this reconstruction loss by 0.0005 so that it does not dominate the margin loss during training. As illustrated below, reconstructions from the 16D output of the CapsNet are robust while keeping only important details.<br />
<br />
[[File:Reconstruction.png|center|900px]]<br />
<br />
=MNIST Experimental Results=<br />
<br />
==Accuracy==<br />
The paper tests on the MNIST dataset with 60K training examples, and 10K testing. Wan et al. [2013] achieves 0.21% test error with ensembling and augmenting the data with rotation and scaling. They achieve 0.39% without them. As shown in Table 1, the authors manage to achieve 0.25% test error with only a 3 layer network; the previous state of the art only beat this number with very deep networks. This example shows the importance of routing and reconstruction regularizer, which boosts the performance. On the other hand, while the accuracies are very high, the number of parameters is much smaller compared to the baseline model.<br />
<br />
[[File:Accuracies.png|center|900px]]<br />
<br />
==What Capsules Represent for MNIST==<br />
The following figure shows the digit representation under capsules. Each row shows the reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 in the range [−0.25, 0.25]. By tweaking the values, we notice how the reconstruction changes, and thus get a sense for what each dimension is representing. The authors found that some dimensions represent global properties of the digits, while other represent localized properties. <br />
[[File:CapsuleReps.png|center|900px]]<br />
<br />
One example the authors provide is: different dimensions are used for the length of the ascender of a 6 and the size of the loop. The variations include stroke thickness, skew and width, as well as digit-specific variations. The authors are able to show dimension representations using a decoder network by feeding a perturbed vector.<br />
<br />
==Robustness of CapsNet==<br />
The authors conclude that DigitCaps capsules learn more robust representations for each digit class than traditional CNNs. The trained CapsNet becomes moderately robust to small affine transformations in the test data.<br />
<br />
To compare the robustness of CapsNet to affine transformations against traditional CNNs, both models (CapsNet and a traditional CNN with MaxPooling and DropOut) were trained on a padded and translated MNIST training set, in which each example is an MNIST digit placed randomly on a black background of 40 × 40 pixels. The networks were then tested on the [http://www.cs.toronto.edu/~tijmen/affNIST/ affNIST] dataset (MNIST digits with random affine transformation). An under-trained CapsNet which achieved 99.23% accuracy on the MNIST test set achieved a corresponding 79% accuracy on the affnist test set. A traditional CNN achieved similar accuracy (99.22%) on the mnist test set, but only 66% on the affnist test set.<br />
<br />
=MultiMNIST & Other Experiments=<br />
<br />
==MultiMNIST==<br />
To evaluate the performance of the model on highly overlapping digits, the authors generate a 'MultiMNIST' dataset. In MultiMNIST, images are two overlaid MNIST digits of the same set(train or test) but different classes. The results indicate a classification error rate of 5%. Additionally, CapsNet can be used to segment the image into the two digits that compose it. Moreover, the model is able to deal with the overlaps and reconstruct digits correctly since each digit capsule can learn the style from the votes of PrimaryCapsules layer (Figure 5).<br />
<br />
There are some additional steps to generating the MultiMNIST dataset.<br />
<br />
1. Both images are shifted by up to 4 pixels in each direction resulting in a 36 × 36 image. Bounding boxes of digits in MNIST overlap by approximately 80%, so this is used to make both digits identifiable (since there is no RGB difference learnable by the network to separate the digits)<br />
<br />
2. The label becomes a vector of two numbers, representing the original digit and the randomly generated (and overlaid) digit.<br />
<br />
<br />
<br />
[[File:CapsuleNets MultiMNIST.PNG|600px|thumb|center|Figure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.<br />
The two reconstructed digits are overlayed in green and red as the lower image. The upper image<br />
shows the input image. L:(l1; l2) represents the label for the two digits in the image and R:(r1; r2)<br />
represents the two digits used for reconstruction. The two right most columns show two examples<br />
with wrong classification reconstructed from the label and from the prediction (P). In the (2; 8)<br />
example the model confuses 8 with a 7 and in (4; 9) it confuses 9 with 0. The other columns have<br />
correct classifications and show that the model accounts for all the pixels while being able to assign<br />
one pixel to two digits in extremely difficult scenarios (column 1 − 4). Note that in dataset generation<br />
the pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a<br />
digit that is neither the label nor the prediction. These columns suggest that the model is not just<br />
finding the best fit for all the digits in the image including the ones that do not exist. Therefore in case<br />
of (5; 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that fit best and account for<br />
all the pixels. Also, in the case of (8; 1) the loop of 8 has not triggered 0 because it is already accounted<br />
for by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other<br />
support.]]<br />
<br />
==Other datasets==<br />
The authors also tested the proposed capsule model on CIFAR10 dataset and achieved an error rate of 10.6%. The model tested was an ensemble of 7 models. Each of the models in the ensemble had the same architecture as the model used for MNIST (apart from 3 additional channels and 64 different types of primary capsules being used). These 7 models were trained on 24x24 patches of the training images for 3 iterations. During experimentation, the authors also found out that adding an additional none-of-the-above category helped improved the overall performance. The error rate achieved is comparable to the error rate achieved by a standard CNN model. According to the authors, one of the reasons for low performance is the fact that background in CIFAR-10 images are too varied for it to be adequately modeled by reasonably sized capsule net.<br />
<br />
The proposed model was also evaluated using a small subset of SVHN dataset. The network trained was much smaller and trained using only 73257 training images. The network still managed to achieve an error rate of 4.3% on the test set.<br />
<br />
=Critique=<br />
Although the network performs incredibly favorable in the author's experiments, it has a long way to go on more complex datasets. On CIFAR 10, the network achieved subpar results, and the experimental results seem to be worse when the problem becomes more complex. This is anticipated, since these networks are still in their early stage; later innovations might come in the upcoming decades/years. It could also be wise to apply the model to other datasets with larger sizes to make the functionality more acceptable. MNIST dataset has simple patterns and even if the model wanted to be presented with only one dataset, it was better not to be MNIST dataset especially in this case that the focus is on human-eye detection and numbers are not that regular in real-life experiences.<br />
<br />
Hinton talks about CapsuleNets revolutionizing areas such as self-driving, but such groundbreaking innovations are far away from CIFAR10, and even further from MNIST. Only time can tell if CapsNets will live up to their hype.<br />
<br />
Moreover, there is no underlying intuition provided on the main point of the paper which is that capsule nets preserve relations between extracted features from the proposed architecture. An explanation on the intuition behind this idea will go a long way in arguing against CNN networks.<br />
<br />
Capsules inherently segment images and learn a lower dimensional embedding in a new manner, which makes them likely to perform well on segmentation and computer vision tasks once further research is done. <br />
<br />
Additionally, these networks are more interpretable than CNNs, and have strong theoretical reasoning for why they could work. Naturally, it would be hard for a new architecture to beat the heavily researched/modified CNNs.<br />
<br />
* ([https://openreview.net/forum?id=HJWLfGWRb]) it's not fully clear how effective it can be performed / how scalable it is. Evaluation is performed on a small dataset for shape recognition. The approach will need to be tested on larger, more challenging datasets.<br />
<br />
=Future Work=<br />
The same authors [N. F. Geoffrey E Hinton, Sara Sabour] presented another paper "MATRIX CAPSULES WITH EM ROUTING" in ICLR 2018, which achieved better results than the work presented in this paper. They presented a new multi-layered capsule network architecture, implemented an EM routing procedure, and introduced "Coordinate Addition". This new type reduced number of errors by 45%, and performed better than standard CNN on white box adversarial attacks. Capsule architectures are gaining interest because of their ability to achieve equivariance of parts, and employ a new form of pooling called "routing" (as opposed to max pooling) which groups parts that make similar predictions of the whole to which they belong, rather than relying on spatial co-locality.<br />
Moreover, the authors hint towards trying to change the curvature and sensitivities to various factors by introducing new form of loss function. It may improve the performance of the model for more complicated data set which is one of the model's drawback.<br />
<br />
Moreover, as mentioned in critiques, a good future work for this group would be making the model more robust to the dataset and achieve acceptable performance on datasets with more regularly seen images in real life experiences.<br />
<br />
=References=<br />
#N. F. Geoffrey E Hinton, Sara Sabour. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.<br />
#S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017<br />
# Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011), Transforming Auto-encoders <br />
#Geoffrey Hinton's talk: What is wrong with convolutional neural nets? - Talk given at MIT. Brain & Cognitive Sciences - Fall Colloquium Series. [https://www.youtube.com/watch?v=rTawFwUvnLE ]<br />
#Understanding Hinton’s Capsule Networks - Max Pechyonkin's series [https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b]<br />
#Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machinelearning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.<br />
#Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visualattention.arXiv preprint arXiv:1412.7755, 2014.<br />
#Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network.arXiv preprintarXiv:1511.02583, 2015.<br />
#Dan C Cire ̧san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classification.arXiv preprint arXiv:1102.0183,2011.<br />
#Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit numberrecognition from street view imagery using deep convolutional neural networks.arXiv preprintarXiv:1312.6082, 2013.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CapsuleNets&diff=42147CapsuleNets2018-11-30T23:02:48Z<p>Z43ma: </p>
<hr />
<div>The paper "Dynamic Routing Between Capsules" was written by three researchers at Google Brain: Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. This paper was published and presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California. The same three researchers recently published a highly related paper "[https://openreview.net/pdf?id=HJWLfGWRb Matrix Capsules with EM Routing]" for ICLR 2018.<br />
<br />
=Motivation=<br />
<br />
Ever since AlexNet eclipsed the performance of competing architectures in the 2012 ImageNet challenge, convolutional neural networks have maintained their dominance in computer vision applications. Despite the recent successes and innovations brought about by convolutional neural networks, some assumptions made in these networks are perhaps unwarranted and deficient. Using a novel neural network architecture, the authors create CapsuleNets, a network that they claim is able to learn image representations in a more robust, human-like manner. With only a 3 layer capsule network, they achieved near state-of-the-art results on MNIST.<br />
<br />
The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. This paper explores an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity. The length of the vector output of a capsule cannot exceed 1 because of an application of a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.<br />
<br />
The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. The authors demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects<br />
<br />
==Adversarial Examples==<br />
<br />
First discussed by Christian Szegedy et. al. in late 2013, adversarial examples have been heavily discussed by the deep learning community as a potential security threat to AI learning. Adversarial examples are defined as inputs that an attacker creates intentionally fool a machine learning model. An example of an adversarial example is shown below: <br />
<br />
[[File:adversarial_img_1.png |center]]<br />
<br />
To human eyes, the image appears to be a panda both before and after noise is injected into the image, whereas the trained ConvNet model discerns the noisy image as a Gibbon with almost 100% certainty. The fact that the network is unable to classify the above image as a panda after the epsilon perturbation leads to many potential security risks in AI dependent systems such as self-driving vehicles. Although various methods have been suggested to combat adversarial examples, robust defenses are hard to construct due to the inherent difficulties in constructing theoretical models for the adversarial example crafting process. However, beyond the fact that these examples may serve as a security threat, it emphasizes that these convolutional neural networks do not learn image classification/object detection patterns the same way that a human would. Rather than identifying the core features of a panda such as its eyes, mouth, nose, and the gradient changes in its black/white fur, the convolutional neural network seems to be learning image representations in a completely different manner. Deep learning researchers often attempt to model neural networks after human learning, and it is clear that further steps must be taken to robustify ConvNets against targeted noise perturbations.<br />
<br />
==Drawbacks of CNNs==<br />
Hinton claims that the key fault with traditional CNNs lies within the pooling function. Although pooling builds translational invariance into the network, it fails to preserve spatial relationships between objects. When we pool, we effectively reduce a <math>k \cdot k</math> kernel of convolved cells into a scalar input. This results in a desired local invariance without inhibiting the network's ability to detect features but causes valuable spatial information to be lost.<br />
<br />
Also, in CNNs, higher-level features combine lower-level features as a weighted sum: activations of a previous layer multiplied by the current layer's weight, then passed to another activation function. In this process, pose relationship between simpler features is not part of the higher-level feature.<br />
<br />
In the example below, the network is able to detect the similar features (eyes, mouth, nose, etc) within both images, but fails to recognize that one image is a human face, while the other is a Picasso-esque due to the CNN's inability to encode spatial relationships after multiple pooling layers.<br />
In deep learning, the activation level of a neuron is often interpreted as the likelihood of detecting a specific feature. CNNs are good at detecting features but less effective at exploring the spatial relationships among features (perspective, size, orientation). <br />
<br />
[[File:Equivariance Face.png |center]]<br />
<br />
Here, the CNN could wrongly activate the neuron for the face detection. Without realizing the mismatch in spatial orientation and size, the activation for the face detection will be too high.<br />
<br />
Conversely, we hope that a CNN can recognize that both of the following pictures contain a kitten. Unfortunately, when we feed the two images into a ResNet50 architecture, only the first image is correctly classified, while the second image is predicted to be a guinea pig.<br />
<br />
<br />
[[File:kitten.jpeg |center]]<br />
<br />
<br />
[[File:kitten-rotated-180.jpg |center]]<br />
<br />
For a more in depth discussion on the problems with ConvNets, please listen to Geoffrey Hinton's talk "What is wrong with convolutional neural nets?" given at MIT during the Brain & Cognitive Sciences - Fall Colloquium Series (December 4, 2014).<br />
<br />
==Intuition for Capsules==<br />
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Hinton argues that our brains reason visual information by deconstructing it into a hierarchical representation which we then match to familiar patterns and relationships from memory. The key difference between this understanding and the functionality of CNNs is that recognition of an object should not depend on the angle from which it is viewed. <br />
<br />
To enforce rotational and translational equivariance, Capsule Networks store and preserve hierarchical pose relationships between objects. The core idea behind capsule theory is the explicit numerical representations of relative relationships between different objects within an image. Building these relationships into the Capsule Networks model, the network is able to recognize newly seen objects as a rotated view of a previously seen object. For example, the below image shows the Statue of Liberty under five different angles. If a person had only seen the Statue of Liberty from one angle, they would be able to ascertain that all five pictures below contain the same object (just from a different angle).<br />
<br />
[[File:Rotational Invariance.jpeg |center]]<br />
<br />
Building on this idea of hierarchical representation of spatial relationships between key entities within an image, the authors introduce Capsule Networks. Unlike traditional CNNs, Capsule Networks are better equipped to classify correctly under rotational invariance. Furthermore, the authors managed to achieve state of the art results on MNIST using a fraction of the training samples that alternative state of the art networks requires.<br />
<br />
=Background, Notation, and Definitions=<br />
<br />
==What is a Capsule==<br />
"Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting, and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold."<br />
<br />
In essence, capsules store object properties in a vector form; probability of detection is encoded as the vector's length, while spatial properties are encoded as the individual vector components. Thus, when a feature is present but the image captures it under a different angle, the probability of detection remains unchanged.<br />
<br />
A brief overview/understanding of capsules can be found in other papers from the author. To quote from [https://openreview.net/pdf?id=HJWLfGWRb this paper]:<br />
<br />
<blockquote><br />
A capsule network consists of several layers of capsules. The set of capsules in layer L is denoted<br />
as <math>\Omega_L</math>. Each capsule has a 4x4 pose matrix, <math>M</math>, and an activation probability, <math>a</math>. These are like the<br />
activities in a standard neural net: they depend on the current input and are not stored. In between<br />
each capsule i in layer L and each capsule j in layer L + 1 is a 4x4 trainable transformation matrix,<br />
<math>W_{ij}</math> . These <math>W_{ij}</math>'s (and two learned biases per capsule) are the only stored parameters and they<br />
are learned discriminatively. The pose matrix of capsule i is transformed by <math>W_{ij}</math> to cast a vote<br />
<math>V_{ij} = M_iW_{ij}</math> for the pose matrix of capsule j. The poses and activations of all the capsules in layer<br />
L + 1 are calculated by using a non-linear routing procedure which gets as input <math>V_{ij}</math> and <math>a_i</math> for all<br />
<math>i \in \Omega_L, j \in \Omega_{L+1}</math><br />
</blockquote><br />
<math></math><br />
<br />
==Notation==<br />
<br />
We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The paper performs a non-linear squashing operation to ensure that vector length falls between 0 and 1, with shorter vectors (less likely to exist entities) being shrunk towards 0. <br />
<br />
\begin{align} \mathbf{v}_j &= \frac{||\mathbf{s}_j||^2}{1+ ||\mathbf{s}_j||^2} \frac{\mathbf{s}_j}{||\mathbf{s}_j||} \end{align}<br />
<br />
where <math>\mathbf{v}_j</math> is the vector output of capsule <math>j</math> and <math>s_j</math> is its total input.<br />
<br />
For all but the first layer of capsules, the total input to a capsule <math>s_j</math> is a weighted sum over all “prediction vectors” <math>\hat{\mathbf{u}}_{j|i}</math> from the capsules in the layer below and is produced by multiplying the output <math>\mathbf{u}i</math> of a capsule in the layer below by a weight matrix <math>\mathbf{W}ij</math><br />
<br />
\begin{align}<br />
\mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, ~\hspace{0.5em} \hat{\mathbf{u}}_{j|i}= \mathbf{W}_{ij}\mathbf{u}_i<br />
\end{align}<br />
where the <math>c_{ij}</math> are coupling coefficients that are determined by the iterative dynamic routing process.<br />
<br />
The coupling coefficients between capsule <math>i</math> and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits <math>b_{ij}</math> are the log prior probabilities that capsule <math>i</math> should be coupled to capsule <math>j</math>.<br />
<br />
\begin{align}<br />
c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}<br />
\end{align}<br />
<br />
=Network Training and Dynamic Routing=<br />
<br />
==Understanding Capsules==<br />
The notation can get somewhat confusing, so I will provide intuition behind the computational steps within a capsule. The following image is taken from naturomic's talk on Capsule Networks.<br />
<br />
[[File:CapsuleNets.jpeg|center|800px]]<br />
<br />
The above image illustrates the key mathematical operations happening within a capsule (and compares them to the structure of a neuron). Although the operations are rather straightforward, it's crucial to note that the capsule computes an affine transformation onto each input vector. The length of the input vectors <math>\mathbf{u}_{i}</math> represent the probability of entity <math>i</math> existing in a lower level. This vector is then reoriented with an affine transform using <math>\mathbf{W}_{ij}</math> matrices that encode spatial relationships between entity <math>\mathbf{u}_{i}</math> and other lower level features.<br />
<br />
We illustrate the intuition behind vector-to-vector matrix multiplication within capsules using the following example: if vectors <math>\mathbf{u}_{1}</math>, <math>\mathbf{u}_{2}</math>, and <math>\mathbf{u}_{3}</math> represent detection of eyes, nose, and mouth respectively, then after multiplication with trained weight matrices <math>\mathbf{W}_{ij}</math> (where j denotes existence of a face), we should get a general idea of the general location of the higher level feature (face), similar to the image below.<br />
<br />
[[File:Predictions.jpeg |center]]<br />
<br />
==Dynamic Routing==<br />
A capsule <math>i</math> in a lower-level layer needs to decide how to send its output vector to higher-level capsules <math>j</math>. This decision is made with probability proportional to <math>c_{ij}</math>. If there are <math>K</math> capsules in the level that capsule <math>i</math> routes to, then we know the following properties about <math>c_{ij}</math>: <math>\sum_{j=1}^M c_{ij} = 1, c_{ij} \geq 0</math><br />
<br />
In essence, the <math>\{c_{ij}\}_{j=1}^M</math> denotes a discrete probability distribution with respect to capsule <math>i</math>'s output location. Lower level capsules decide which higher level capsules to send vectors into by adjusting the corresponding routing weights <math>\{c_{ij}\}_{j=1}^M</math>. After a few iterations in training, numerous vectors will have already been sent to all higher level capsules. Based on the similarity between the current vector being routed and all vectors already sent into the higher level capsules, we decide which capsule to send the current vector into.<br />
[[File:Dynamic Routing.png|center|900px]]<br />
<br />
From the image above, we notice that a cluster of points similar to the current vector has already been routed into capsule K, while most points in capsule J are highly dissimilar. It thus makes more sense to route the current observations into capsule K; we adjust the corresponding weights upward during training.<br />
<br />
These weights are determined through the dynamic routing procedure:<br />
<br />
<br />
[[File:Routing Algo.png|900px]]<br />
<br />
Note that the convergence of this routing procedure has been questioned. Although it is empirically shown that this procedure converges, the convergence has not been proven.<br />
<br />
Although dynamic routing is not the only manner in which we can encode relationships between capsules, the premise of the paper is to demonstrate the capabilities of capsules under a simple implementation. Since the paper was released in 2017, numerous alternative routing implementations have been released including an EM matrix routing algorithm by the same authors (ICLR 2018).<br />
<br />
=Architecture=<br />
The capsule network architecture given by the authors has 11.36 million trainable parameters. The paper itself is not very detailed on exact implementation of each architectural layer, and hence it leaves some degree of ambiguity on coding various aspects of the original network. The capsule network has 6 overall layers, with the first three layers denoting components of the encoder, and the last 3 denoting components of the decoder.<br />
<br />
==Loss Function==<br />
[[File:Loss Function.png|900px]]<br />
<br />
The cost function looks very complicated, but can be broken down into intuitive components. Before diving into the equation, remember that the length of the vector denotes the probability of object existence. The left side of the equation denotes loss when the network classifies an observation correctly; the term becomes zero when the classification is incorrect. To compute loss when the network correctly classifies the label, we subtract the vector norm from a fixed quantity <math>m^+ := 0.9</math>. On the other hand, when the network classifies a label incorrectly, we penalize the loss based on the network's confidence in the incorrect label; we compute the loss by subtracting <math>m^- := 0.1</math> from the vector norm.<br />
<br />
A graphical representation of loss function values under varying vector norms is given below.<br />
[[File:Loss function chart.png|900px]]<br />
<br />
==Encoder Layers==<br />
All experiments within this paper were conducted on the MNIST dataset, and thus the architecture is built to classify the corresponding dataset. For more complex datasets, the experiments were less promising. <br />
<br />
[[File:Architecture.png|center|900px]]<br />
<br />
The encoder layer takes in a 28x28 MNIST image and learns a 16 dimensional representation of instantiation parameters.<br />
<br />
'''Layer 1: Convolution''': <br />
This layer is a standard convolution layer. Using kernels with size 9x9x1, a stride of 1, and a ReLU activation function, we detect the 2D features within the network.<br />
<br />
'''Layer 2: PrimaryCaps''': <br />
We represent the low level features detected during convolution as 32 primary capsules. Each capsule applies eight convolutional kernels with stride 2 to the output of the convolution layer and feeds the corresponding transformed tensors into the DigiCaps layer.<br />
<br />
'''Layer 3: DigiCaps''': <br />
This layer contains 10 digit capsules, one for each digit. As explained in the dynamic routing procedure, each input vector from the PrimaryCaps layer has its own corresponding weight matrix <math>W_{ij}</math>. Using the routing coefficients <math>c_{ij}</math> and temporary coefficients <math>b_{ij}</math>, we train the DigiCaps layer to output a ten 16 dimensional vectors. The length of the <math>i^{th}</math> vector in this layer corresponds to the probability of detection of digit <math>i</math>.<br />
<br />
==Decoder Layers==<br />
The decoder layer aims to train the capsules to extract meaningful features for image detection/classification. During training, it takes the 16 layer instantiation vector of the correct (not predicted) DigiCaps layer, and attempts to recreate the 28x28 MNIST image as best as possible. Setting the loss function as reconstruction error (Euclidean distance between the reconstructed image and original image), we tune the capsules to encode features that are meaningful within the actual image.<br />
<br />
[[File:Decoder.png|center|900px]]<br />
<br />
The layer consists of three fully connected layers, and transforms a 16x1 vector from the encoder layer into a 28x28 image.<br />
<br />
In addition to the digicaps loss function, we add reconstruction error as a form of regularization. During training, everything but the activity vector of the correct digit capsule is masked, and then this activity vector is used to reconstruct the input image. We minimize the Euclidean distance between the outputs of the logistic units and the pixel intensities of the original and reconstructed images. We scale down this reconstruction loss by 0.0005 so that it does not dominate the margin loss during training. As illustrated below, reconstructions from the 16D output of the CapsNet are robust while keeping only important details.<br />
<br />
[[File:Reconstruction.png|center|900px]]<br />
<br />
=MNIST Experimental Results=<br />
<br />
==Accuracy==<br />
The paper tests on the MNIST dataset with 60K training examples, and 10K testing. Wan et al. [2013] achieves 0.21% test error with ensembling and augmenting the data with rotation and scaling. They achieve 0.39% without them. As shown in Table 1, the authors manage to achieve 0.25% test error with only a 3 layer network; the previous state of the art only beat this number with very deep networks. This example shows the importance of routing and reconstruction regularizer, which boosts the performance. On the other hand, while the accuracies are very high, the number of parameters is much smaller compared to the baseline model.<br />
<br />
[[File:Accuracies.png|center|900px]]<br />
<br />
==What Capsules Represent for MNIST==<br />
The following figure shows the digit representation under capsules. Each row shows the reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 in the range [−0.25, 0.25]. By tweaking the values, we notice how the reconstruction changes, and thus get a sense for what each dimension is representing. The authors found that some dimensions represent global properties of the digits, while other represent localized properties. <br />
[[File:CapsuleReps.png|center|900px]]<br />
<br />
One example the authors provide is: different dimensions are used for the length of the ascender of a 6 and the size of the loop. The variations include stroke thickness, skew and width, as well as digit-specific variations. The authors are able to show dimension representations using a decoder network by feeding a perturbed vector.<br />
<br />
==Robustness of CapsNet==<br />
The authors conclude that DigitCaps capsules learn more robust representations for each digit class than traditional CNNs. The trained CapsNet becomes moderately robust to small affine transformations in the test data.<br />
<br />
To compare the robustness of CapsNet to affine transformations against traditional CNNs, both models (CapsNet and a traditional CNN with MaxPooling and DropOut) were trained on a padded and translated MNIST training set, in which each example is an MNIST digit placed randomly on a black background of 40 × 40 pixels. The networks were then tested on the [http://www.cs.toronto.edu/~tijmen/affNIST/ affNIST] dataset (MNIST digits with random affine transformation). An under-trained CapsNet which achieved 99.23% accuracy on the MNIST test set achieved a corresponding 79% accuracy on the affnist test set. A traditional CNN achieved similar accuracy (99.22%) on the mnist test set, but only 66% on the affnist test set.<br />
<br />
=MultiMNIST & Other Experiments=<br />
<br />
==MultiMNIST==<br />
To evaluate the performance of the model on highly overlapping digits, the authors generate a 'MultiMNIST' dataset. In MultiMNIST, images are two overlaid MNIST digits of the same set(train or test) but different classes. The results indicate a classification error rate of 5%. Additionally, CapsNet can be used to segment the image into the two digits that compose it. Moreover, the model is able to deal with the overlaps and reconstruct digits correctly since each digit capsule can learn the style from the votes of PrimaryCapsules layer (Figure 5).<br />
<br />
There are some additional steps to generating the MultiMNIST dataset.<br />
<br />
1. Both images are shifted by up to 4 pixels in each direction resulting in a 36 × 36 image. Bounding boxes of digits in MNIST overlap by approximately 80%, so this is used to make both digits identifiable (since there is no RGB difference learnable by the network to separate the digits)<br />
<br />
2. The label becomes a vector of two numbers, representing the original digit and the randomly generated (and overlaid) digit.<br />
<br />
<br />
<br />
[[File:CapsuleNets MultiMNIST.PNG|600px|thumb|center|Figure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.<br />
The two reconstructed digits are overlayed in green and red as the lower image. The upper image<br />
shows the input image. L:(l1; l2) represents the label for the two digits in the image and R:(r1; r2)<br />
represents the two digits used for reconstruction. The two right most columns show two examples<br />
with wrong classification reconstructed from the label and from the prediction (P). In the (2; 8)<br />
example the model confuses 8 with a 7 and in (4; 9) it confuses 9 with 0. The other columns have<br />
correct classifications and show that the model accounts for all the pixels while being able to assign<br />
one pixel to two digits in extremely difficult scenarios (column 1 − 4). Note that in dataset generation<br />
the pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a<br />
digit that is neither the label nor the prediction. These columns suggest that the model is not just<br />
finding the best fit for all the digits in the image including the ones that do not exist. Therefore in case<br />
of (5; 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that fit best and account for<br />
all the pixels. Also, in the case of (8; 1) the loop of 8 has not triggered 0 because it is already accounted<br />
for by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other<br />
support.]]<br />
<br />
==Other datasets==<br />
The authors also tested the proposed capsule model on CIFAR10 dataset and achieved an error rate of 10.6%. The model tested was an ensemble of 7 models. Each of the models in the ensemble had the same architecture as the model used for MNIST (apart from 3 additional channels and 64 different types of primary capsules being used). These 7 models were trained on 24x24 patches of the training images for 3 iterations. During experimentation, the authors also found out that adding an additional none-of-the-above category helped improved the overall performance. The error rate achieved is comparable to the error rate achieved by a standard CNN model. According to the authors, one of the reasons for low performance is the fact that background in CIFAR-10 images are too varied for it to be adequately modeled by reasonably sized capsule net.<br />
<br />
The proposed model was also evaluated using a small subset of SVHN dataset. The network trained was much smaller and trained using only 73257 training images. The network still managed to achieve an error rate of 4.3% on the test set.<br />
<br />
=Critique=<br />
Although the network performs incredibly favorable in the author's experiments, it has a long way to go on more complex datasets. On CIFAR 10, the network achieved subpar results, and the experimental results seem to be worse when the problem becomes more complex. This is anticipated, since these networks are still in their early stage; later innovations might come in the upcoming decades/years. It could also be wise to apply the model to other datasets with larger sizes to make the functionality more acceptable. MNIST dataset has simple patterns and even if the model wanted to be presented with only one dataset, it was better not to be MNIST dataset especially in this case that the focus is on human-eye detection and numbers are not that regular in real-life experiences.<br />
<br />
Hinton talks about CapsuleNets revolutionizing areas such as self-driving, but such groundbreaking innovations are far away from CIFAR10, and even further from MNIST. Only time can tell if CapsNets will live up to their hype.<br />
<br />
Moreover, there is no underlying intuition provided on the main point of the paper which is that capsule nets preserve relations between extracted features from the proposed architecture. An explanation on the intuition behind this idea will go a long way in arguing against CNN networks.<br />
<br />
Capsules inherently segment images and learn a lower dimensional embedding in a new manner, which makes them likely to perform well on segmentation and computer vision tasks once further research is done. <br />
<br />
Additionally, these networks are more interpretable than CNNs, and have strong theoretical reasoning for why they could work. Naturally, it would be hard for a new architecture to beat the heavily researched/modified CNNs.<br />
<br />
* ([https://openreview.net/forum?id=HJWLfGWRb]) it's not fully clear how effective it can be performed / how scalable it is. Evaluation is performed on a small dataset for shape recognition. The approach will need to be tested on larger, more challenging datasets.<br />
<br />
=Future Work=<br />
The same authors [N. F. Geoffrey E Hinton, Sara Sabour] presented another paper "MATRIX CAPSULES WITH EM ROUTING" in ICLR 2018, which achieved better results than the work presented in this paper. They presented a new multi-layered capsule network architecture, implemented an EM routing procedure, and introduced "Coordinate Addition". This new type reduced number of errors by 45%, and performed better than standard CNN on white box adversarial attacks. Capsule architectures are gaining interest because of their ability to achieve equivariance of parts, and employ a new form of pooling called "routing" (as opposed to max pooling) which groups parts that make similar predictions of the whole to which they belong, rather than relying on spatial co-locality.<br />
Moreover, the authors hint towards trying to change the curvature and sensitivities to various factors by introducing new form of loss function. It may improve the performance of the model for more complicated data set which is one of the model's drawback.<br />
<br />
Moreover, as mentioned in critiques, a good future work for this group would be making the model more robust to the dataset and achieve acceptable performance on datasets with more regularly seen images in real life experiences.<br />
<br />
=References=<br />
#N. F. Geoffrey E Hinton, Sara Sabour. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.<br />
#S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017<br />
# Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011), Transforming Auto-encoders <br />
#Geoffrey Hinton's talk: What is wrong with convolutional neural nets? - Talk given at MIT. Brain & Cognitive Sciences - Fall Colloquium Series. [https://www.youtube.com/watch?v=rTawFwUvnLE ]<br />
#Understanding Hinton’s Capsule Networks - Max Pechyonkin's series [https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b]<br />
#Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machinelearning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.<br />
#Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visualattention.arXiv preprint arXiv:1412.7755, 2014.<br />
#Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network.arXiv preprintarXiv:1511.02583, 2015.<br />
#Dan C Cire ̧san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classification.arXiv preprint arXiv:1102.0183,2011.<br />
#Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit numberrecognition from street view imagery using deep convolutional neural networks.arXiv preprintarXiv:1312.6082, 2013.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Neural_Machine_Translation&diff=42146Unsupervised Neural Machine Translation2018-11-30T22:59:59Z<p>Z43ma: </p>
<hr />
<div>This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]<br />
<br />
= Introduction =<br />
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist. Often, languages can also suffer from having poor resources for translation (e.g. Basque), which could lead to the problem of the dataset being too small (Koehn & Knowles, 2017).<br />
<br />
Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). Their approaches have included pivoting or triangulation techniques [Chen et al., 2017], and semi supervised approaches [He, 2016]. However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.<br />
<br />
The general approach of the methodology is to:<br />
<br />
# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.<br />
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.<br />
Then iteratively perform:<br />
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.<br />
# Tune the decoder in each language by back-translating between the source and target language.<br />
<br />
= Background =<br />
<br />
===Word Embedding Alignment===<br />
<br />
The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. They improve the continuous Skip-gram model for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2. <br />
<br />
Figure 1 shows an example of aligning the word embeddings in English and French.<br />
<br />
[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]<br />
<br />
Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.<br />
<br />
The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.<br />
<br />
===Other related work and inspirations===<br />
====Statistical Decipherment for Machine Translation====<br />
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.<br />
<br />
====Low-Resource Neural Machine Translation====<br />
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.<br />
<br />
Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back-translated to the source language and compared with the original sentence.<br />
<br />
The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.<br />
<br />
= Related Works =<br />
<br />
=== 2.1 UNSUPERVISED CROSS-LINGUAL EMBEDDINGS ===<br />
<br />
A majority of methods for learning cross-lingual word embeddings depend on some bilingual signal at the document level. Embedding mapping methods independently train the embeddings in different languages using monolingual corpora and subsequently learn a linear transformation that maps them to a shared space based on a bilingual dictionary. While the dictionary used in these earlier work typically contains a few thousands entries, Artetxe et al. (2017) propose a simple self-learning extension that gives comparable results with an automatically generated list of numerals, which is used as a shortcut for practical unsupervised learning.<br />
<br />
=== 2.2 STATISTICAL DECIPHERMENT FOR MACHINE TRANSLATION ===<br />
<br />
A considerable body of work in statistical decipherment techniques treat the source language as ciphertext and model the process by which this ciphertext is generated as a two-stage process involving the generation of the original English sequence and the probabilistic replacement of the words in it. The English generative process is modeled using a standard n-gram language model, and the channel model parameters are estimated using either expectation maximization or Bayesian inference. This approach was shown to benefit from the incorporation of syntactic knowledge of the languages involved (Dou & Knight, 2013; Dou et al., 2015). More in line with our proposal, the use of word embeddings has also been shown to bring significant improvements in statistical decipherment for machine translation (Dou et al., 2015). Another newly developed method is using a relatively new deep architecture called Sum-Product network to do machine translation. Hoifung Poon, Pedro Domingos[2011] It is a hybrid model that combines the probabilistic modeling and deep architectures. The main advantage of this model is that it has clear semantics and provide great interoperability, and like many other deep architectures, it can be trained using gradient descent. Sum-product network can be applied in the machine translation field, where one can model the language translation in the following one P(English | French) = p(French / English) * p(English) / p(French), where P(English / French) is the probability that an English text corresponds to a given French text, and P(French/ English) is vice versa. We can use Sum-product network to model each of the above probability and thus doing machine translation.<br />
<br />
=== 2.3 LOW-RESOURCE NEURAL MACHINE TRANSLATION ===<br />
<br />
A simple yet effective approach is to create a synthetic parallel corpus by back-translating a monolingual corpus in the target language (Sennrich et al., 2016a). At the same time, Currey et al. (2017) showed that training an NMT system to directly copy target language text is also helpful and complementary with back-translation. Finally, Ramachandran et al. (2017) pre-train the encoder and the decoder in language modeling. Another method trains two agents to translate in opposite directions (e.g. French → English and English → French), and make them teach each other through a reinforcement learning process. This approach still requires a parallel corpus of a considerable size for a good start.<br />
<br />
= Methodology =<br />
<br />
The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experimented with an alternate way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016] (Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).<br />
<br />
The tokens were then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.<br />
<br />
The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.<br />
<br />
Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:<br />
<br />
#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.<br />
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language. <br />
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is different. This way if the same word occurs in two different languages and has a different meaning in the respective languages then each word would get a different vector in the respective languages despite being in the same vector space. <br />
<br />
[[File:Figure2_lwali.png|600px|center]]<br />
<br />
The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.<br />
<br />
'''Note on the need for alignment:''' To train the decoders (in an admittedly “supervised” manner) we make the assumption that they decode from the same latent space. Thus, given a sentence in either language, it needs to represent it in the same latent space to allow training. However, during the back-translation training, the shared encoder stays fixed. This implies that the encoder needs to be set beforehand. For this reason, the process of embedding and alignment is needed. <br />
<br />
===Denoising===<br />
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.<br />
<br />
Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:<br />
<br />
# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.<br />
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.<br />
<br />
The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the internal structure of a language to decode the sentence into the correct order.<br />
<br />
===Back-Translation===<br />
<br />
With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:<br />
<br />
# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,<br />
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,<br />
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
This approach alleviates issues that would have resulted from the training procedure only dealing with a single language at a time. The corpus of a language is converted to a synthetic translation, and trained to predict the original sentence from this translation. <br />
<br />
Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at once, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.<br />
<br />
===Training===<br />
<br />
Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence. <br />
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.<br />
<br />
The authors use Adam as their optimizer with a learning rate of α = 0.0002 (Kingma & Ba, 2015). During training, dropout regularization is implemented with a drop probability p = 0.3. Given that no parallel data is used for development purposes, the authors perform a fixed number of iterations (300,000) to train each variant. <br />
<br />
Considering recently demonstrated weaker convergence of Adam (compared to SGD), repeating the experiments with other optimizers might provide better results.<br />
<br />
=Experiments and Results=<br />
<br />
The model was evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.<br />
<br />
The paper trained translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.<br />
<br />
[[File:Table1_lwali.png|600px|center]]<br />
<br />
The results exhibit that for the proposed system to work properly, back-translation is necessary. The denoising technique alone is below the baseline while big improvements appear when introducing back-translation.<br />
<br />
===Unsupervised===<br />
<br />
The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.<br />
<br />
The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of back-translation, however, does show large improvement on all tested cases.<br />
<br />
For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.<br />
<br />
===Semi-supervised===<br />
<br />
Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.<br />
<br />
Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.<br />
<br />
===Supervised===<br />
<br />
This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.<br />
<br />
The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.<br />
<br />
===Qualitative Analysis===<br />
<br />
[[File:Table2_lwali.png|600px|center]]<br />
<br />
Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Especially, the authors point that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.<br />
<br />
=Conclusions and Future Work=<br />
<br />
The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.<br />
<br />
Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, combining the proposed method with a small parallel corpus, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:<br />
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.<br />
*Decouple the shared encoder into 2 independent encoders at some point during training<br />
*Progressively reduce the noise level<br />
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis<br />
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.<br />
<br />
= Critique =<br />
<br />
While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution. <br />
<br />
As pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.<br />
<br />
The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.<br />
<br />
Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results. <br />
<br />
The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.<br />
<br />
The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.<br />
<br />
Their qualitative analysis just checks whether their proposed unsupervised NMT generates a sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.<br />
<br />
* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.<br />
<br />
= References =<br />
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."<br />
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".<br />
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."<br />
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."<br />
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."<br />
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."<br />
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."<br />
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."<br />
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"<br />
#'''[ Koehn & Knowles, 2017]''' Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation.<br />
#'''[Chen et al., 2017]''' Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. A teacher-student framework for zero-resource neural machine translation.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=42145Learning to Teach2018-11-30T22:54:23Z<p>Z43ma: </p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half<br />
of the training data to train a ResNet model as the student.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. <br />
<br />
The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
==Problem Definition==<br />
In supervised learning, the goal is to choose a function <math display="inline">f_w(x)</math> with <math display="inline">w</math> as the parameter vector to predict the supervisor's label as good as possible. The goodness of a function <math display="inline">f_w</math> is evaluated by the risk function: <br />
<br />
\begin{align*}R(w) = \int M(y, f_w(x))dP(x,y)\end{align*}<br />
<br />
where <math display="inline">M(,)</math> is the metric which evaluate the gap between the label and the prediction.<br />
<br />
The student model, denoted &mu;(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:<br />
<br />
\begin{align*}<br />
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)<br />
\end{align*}<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
In contrast to traditional machine learning, which is only concerned with the student model in the<br />
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide<br />
appropriate inputs to the student model so that it can achieve low risk functional as efficiently<br />
as possible.<br />
<br />
<br />
::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.<br />
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,<br />
the teacher model can be used to teach either<br />
new student models, or the same student<br />
models in new learning scenarios such as another<br />
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations<br />
S are the same across different student<br />
models and different scenarios. The L2T process is outlined in figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.<br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space. <br />
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.<br />
<br />
Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of teacher in the L2T framework is <br />
<br />
<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center><br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns. <br />
<br />
The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.<br />
<br />
Data features contain information for data instance, such as its label category, (for texts) the length of sentence, linguistic features for text segments (Tsvetkov et al., 2016), or (for images) gradients histogram features (Dalal & Triggs, 2005).<br />
<br />
Student model features include the signals reflecting how well current neural network is trained. The authors collect several simple features, such as passed mini-batch number (i.e., iteration), the average historical training loss and historical validation accuracy.<br />
<br />
Some additional features are collected to represent the combination of both data and learner model. By using these features, the authors aim to represent how important the arrived training data is for current leaner. The authors mainly use three parts of such signals in our classification tasks: 1) the predicted probabilities of each class; 2) the loss value on that data, which appears frequently in self-paced learning (Kumar et al., 2010; Jiang et al., 2014a; Sachan & Xing, 2016); 3) the margin value.<br />
<br />
The optimizer for training the teacher model is the maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]<br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.<br />
<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.<br />
===Training a New Student===<br />
<br />
In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
4) As they have focused on data teaching exploring loss function teaching would be interesting.<br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.<br />
<br />
Also, teaching should not be limited to data, loss function and hypothesis space. In a human teacher-student model, the teaching contents are concepts and logical rules, similar to weights of hidden layers in neural networks. How to transfer such knowledge is interesting to investigate.<br />
<br />
The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search&diff=42144Hierarchical Representations for Efficient Architecture Search2018-11-30T22:50:12Z<p>Z43ma: </p>
<hr />
<div>Summary of the paper: [https://arxiv.org/abs/1711.00436 ''Hierarchical Representations for Efficient Architecture Search'']<br />
<br />
= Introduction =<br />
<br />
Deep Neural Networks (DNNs) have shown remarkable performance in several areas such as computer vision, natural language processing, among others; however, improvements over previous benchmarks have required extensive research and experimentation by domain experts. In DNNs, the composition of linear and nonlinear functions produce internal representations of data which are in most cases better than handcrafted ones; consequently, researchers using Deep Learning techniques have lately shifted their focus from working on input features to designing optimal DNN architectures. However, the quest for finding an optimal DNN architecture by combining layers and modules requires frequent trial and error experiments, a task that resembles the previous work on looking for handcrafted optimal features. As researchers aim to solve more difficult challenges the complexity of the resulting DNN is also increasing; therefore, some studies are introducing the use of automated techniques focused on searching for optimal architectures. The latest emerging field, Neural Architecture Search, is aimed to tackle exactly this problem. The goal of Neural Architecture Search is to try to transform the problem of designing a network into a search problem. For a search problem, it needs a clear definition of three things: the search space, the search strategy, and performance evaluation strategy. The search space is a high-level description of the architecture of the network. The search space needs to contain enough freedom such that the resulted model will have enough expressive power, but cannot be too broad thus makes the search process too computational consuming. The search strategy is how to efficiently search in the search space. The performance evaluation strategy is the methods that are used to evaluate the network. Here, the evaluation is more tricky because in order to evaluate a neural network, we need to train it first, and training takes time. So it is important to define a proxy task that can help us better evaluate a network. Here, this paper will tackle these problems with a new hierarchical representation.<br />
<br />
Lately, the use of algorithms for finding optimal DNN architectures has attracted the attention of researchers who have tackled the problem through four main groups of techniques. The first such method employs a supplementary network called a “Hypernet”, which generates ideal network weights given a random architecture. There are two main parts to generating an “optimal” architecture. First, we train the HyperNet. One training cycle consists of generating a random architecture from a sample space of allowed architectures and generating its predicted weights with the HyperNet. Then, the validation score of this proposed network is calculated, and the error is used to backpropagate through the HyperNet. In this manner, the HyperNet can learn to assign robustly optimal initial weights to a given architecture. At “test” time, we generate a random sample of architectures and predict initialized weights for each with our tuned HyperNet. We take the model with the highest validation score and train it as we would a regular architecture. We use this heuristic of “initial validation error” as the relative performance of networks typically stays constant throughout training. That is, if it starts of better, it will very likely end better. The second technique is Monte Carlo Tree Search (MCTS) which repeatedly narrows the search space by focusing on the most promising architectures previously seen. The third group of techniques use evolutionary algorithms where fitness criteria are applied to filter the initial population of DNN candidates, then new individuals are added to the population by selecting the best-performing ones and modifying them with one or several random mutations as in [https://arxiv.org/abs/1703.01041 [Real, 2017]]. The fourth and last group of techniques implement Reinforcement Learning where a policy based controller seeks to optimize the expected accuracy of new architectures based on rewards (accuracy) gained from previous proposals in the architecture space. From these four groups of techniques, Reinforcement Learning has offered the best experimental results; however, the paper we are summarizing implements evolutionary algorithms as its main approach.<br />
<br />
Despite the technique used to look for an optimal architecture, searching in the architecture space usually requires the training and evaluation of many DNN candidates; therefore, it demands huge computational resources and poses a significant limitation for practical applications. Consequently, most techniques narrow the search space with predefined heuristics, either at the beginning or dynamically during the searching process. In the paper we are summarizing, the authors reduce the number of feasible architectures by forcing a hierarchical structure between network components. In other words, each DNN suggested as a candidate is formed by combining basic building blocks to form small modules, then the same basic structures introduced on the building blocks are used to combine and stack networks on the upper levels of the hierarchy. This approach allows the searching algorithm to sample highly complex and modularized networks similar to Inception or ResNet.<br />
<br />
Despite some weaknesses regarding the efficiency of evolutionary algorithms, this study reveals that in fact, these techniques can generate architectures which show competitive performance when a narrowing strategy is imposed over the search space. Accordingly, the main contributions of this paper is a well-defined set of hierarchical representations which acts as the filtering criteria to pick DNN candidates and a novel evolutionary algorithm which produces image classifiers that achieve state of the art performance among similar evolutionary-based techniques.<br />
<br />
=Architecture representations=<br />
<br />
==Flat architecture representation==<br />
All the evaluated network architectures are directed acyclic graphs with only one source and one sink. Each node in the network represents a feature map and consequently, each directed edge represents an operation that takes the feature map in the departing node as input and outputs a feature map on the arriving node. Under the previous assumption, any given architecture in the narrowed search space is formally expressed as a graph assembled by a series of operations (edges) among a defined set of adjacent feature maps (nodes).<br />
<br />
[[File:flatarch.PNG | 650px|thumb|center|Flat architecture representation os neural networks]]<br />
<br />
Multiple primitive operations defined in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Primitive_operations section 2.3] are used to form small networks defined as ''motifs'' by the authors. To combine the outputs of multiple primitive operations and guarantee a unique output per motif the authors introduce a merge operation which in practice works as a depthwise concatenation that does not require inputs with the same number of channels.<br />
<br />
Accordingly, these motifs can also be combined to form more complex motifs on a higher level in the hierarchy until the network is complex enough to perform competitively in challenging classification tasks.<br />
<br />
==Hierarchical architecture representation==<br />
<br />
The composition of more complex motifs based on simpler motifs at lower levels allows the authors to create a hierarchy-like representation of very complex DNN starting with only a few primitive operations as shown in Figure 1. In other words, an architecture with <math> L </math> levels has only primitive operations at its bottom and only one complex motif at its top. Any motif in between the bottom and top levels can be defined as the composition of motifs in lower levels of the hierarchy.<br />
<br />
Formally, the <math>m</math>-th motif in level <math>l</math>, <math>o_m^{(l)}</math>, is recursively defined as the composition of lower-level motifs <math>\textbf{o}^{(l-1)}</math> according to its network structure.<br />
<br />
<center><math> o_m^{(l)}=assemble(G_m^{(l)}, \textbf{o}^{(l-1)})</math></center><br />
<br />
[[File:hierarchicalrep.PNG | 700px|thumb|center|Figure 1. Hierarchical architecture representation]]<br />
<br />
In figure 1, the architecture of the full model (its flat structure) is shown in the top right corner. The input (source) is the bottom-most node. The output (sink) is the topmost node. The paper presents an alternative hierarchical view of the model shown on the left-hand side (before the assemble function). This view represents the same model in three layers. The first layer is a set of primitive operations only (bottom row, middle column). In all other layers component motifs (computational graphs) G are described by an adjacency matrix and a set of operations. The set of operations are from the previous layer. An example motif <math> G^{(2)}_{1}</math> in the second layer is shown in the bottom row (left and middle columns). There are three unique motifs in the second layer. These are shown in the middle layer of the top row. Note that the motifs in the previous layer become the operations in the next layer. The higher layer can use these motifs multiple times. Finally, the top level graph, which contains only one motif, <math> G^{(3)}_{1}</math>, is shown in the top row left column. Here, there are 4 nodes with 6 operations defined between them.<br />
<br />
==Primitive operations==<br />
<br />
The six primitive operations used as building blocks for connecting nodes in either flat or hierarchical representations are:<br />
* 1 × 1 convolution of C channels<br />
* 3 × 3 depthwise convolution<br />
* 3 × 3 separable convolution of C channels<br />
* 3 × 3 max-pooling<br />
* 3 × 3 average-pooling<br />
* Identity mapping<br />
<br />
The authors argue that convolution operations involving larger receptive fields can be obtained by the composition of lower-level motifs with smaller receptive fields. Accordingly, convolution operations considering a large number of channels can be generated by the depthwise concatenation of lower-level motifs. Batch normalization and ''ReLU'' activation function are applied after each convolution in the network. There is a seventh operation called null and is used in the adjacency matrix <math> G </math> to state explicitly that there are no operations between two nodes.<br />
<br />
<br />
Side note:<br />
<br />
Some explanations for different types for convolution:<br />
<br />
* Spatial convolution: Convolutions performed in spatial dimensions - width and height.<br />
* Depthwise convolution: Spatial convolution performed independently over each channel of an input.<br />
* 1x1 convolution: Convolution with the kernel of size 1x1<br />
<br />
[[File:convolutions.png | 350px|thumb|center]]<br />
<br />
=Evolutionary architecture search=<br />
<br />
Before moving forward we introduce the concept of genotypes in the context of the article. In this article, a genotype is a particular neural network architecture defined according to the components described in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_representations section 2]. In order to make the NN architectures ''evolve'' the authors implemented a three stages process that includes establishing the permitted mutations, creating an initial population and make them compete in a tournament where only the best candidates will survive.<br />
<br />
==Mutation==<br />
<br />
One mutation over a specific architecture is a sequence of five changes in the following order:<br />
<br />
* Sample a level in the hierarchy, different than the basic level.<br />
* Sample a motif in that level.<br />
* Sample a successor node <math>(i)</math> in the motif.<br />
* Sample a predecessor node <math>(j)</math> in the motif.<br />
* Replace the current operation between nodes <math>i</math> and <math>j</math> from one of the available operations.<br />
<br />
The original operation between the nodes <math>i</math> and <math>j</math> in the graph is defined as <math> [G_{m}^{\left ( l \right )}] _{ij} = k </math>. Therefore, a mutation between the same pair of nodes is defined as <math> [G_{m}^{\left ( l \right )}] _{ij} = {k}' </math>.<br />
<br />
The allowed mutations include:<br />
# Change the basic primitive between the predecessor and successor nodes (ie. alter an existing edge): if <math>o_k^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} \neq >o_k^{(l-1)}</math><br />
# Add a connection between two previously unconnected nodes. The connection between the node can have any of the six possible primitives: if <math>o_k^{(l-1)}=none</math> and <math>o_{k'}^{(l-1)} \neq none</math><br />
# Remove a connection between existing nodes: if <math>o_k^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} = none</math><br />
<br />
==Initialization==<br />
<br />
An initial population is required to start the evolutionary algorithm; therefore, the authors introduced a trivial genotype (candidate solution, the hierarchical architecture of the model) composed only of identity mapping operations. Then a large number of random mutations was run over the ''trivial genotype'' to simulate a diversification process. The authors argue that this diversification process generates a representative population in the search space and at the same time prevents the use of any handcrafted NN structures. Surprisingly, some of these random architectures show a performance comparable to the performance achieved by the architectures found later during the evolutionary search algorithm.<br />
<br />
==Search algorithms==<br />
<br />
Tournament selection and random search are the two search algorithms used by the authors. <br />
<br />
=== Tournament Selection ===<br />
In one iteration of the tournament selection algorithm, 5% of the entire population is randomly selected, trained, and evaluated against a validation set. Then the best performing genotype is picked to go through the mutation process and put back into the population. No genotype is ever removed from the population, but the selection criteria guarantee that only the best performing models will be selected to ''evolve'' through the mutation process.<br />
<br />
We define the pseudocode for tournament selection as follows:<br />
<br />
1. Choose k (the tournament size) individuals from the population at random<br />
<br />
2. Choose the best individual from the tournament with probability p<br />
<br />
3. Choose the second best individual with probability p*(1-p)<br />
<br />
4. Choose the third best individual with probability p*((1-p)^2)<br />
<br />
5. Continue until number of selected individuals equal the number we desire.<br />
<br />
Tournament selection is often chosen over alternative genetic algorithms due to the following benefits: it is efficient to code, works on parallel architectures and allows the selection pressure to be easily adjusted.<br />
<br />
=== Random Search ===<br />
In the random search algorithm every genotype from the initial population is trained and evaluated, then the best performing model is selected. In contrast to the tournament selection algorithm, the random search algorithm is much simpler and the training and evaluation process for every genotype can be run in parallel to reduce search time. This algorithm is not widely studied in literature yet.<br />
<br />
==Implementation==<br />
<br />
To implement the tournament selection algorithm two auxiliary algorithms are introduced. The first is called the controller and directs the evolution process over the population, in other words, the controller repeatedly picks 5% of genotypes from the current population, send them to the tournament and then apply a random mutation over the best performing genotype from each group. <br />
<br />
[[File:asyncevoalgorithm1.PNG | 700px|thumb|center|Controller]]<br />
<br />
The second auxiliary algorithm is called the worker and is in charge of training and evaluating each genotype, a task that must be completed each time a new genotype is created and added to the population either by an initialization step or by an evolutionary step.<br />
<br />
[[File:asyncevoalgorithm2.PNG | 700px|thumb|center|Worker]]<br />
<br />
Both auxiliary algorithms work together asynchronously and communicate each other through a shared tabular memory file where genotypes and their corresponding fitness are recorded.<br />
<br />
=Experiments and results=<br />
<br />
==Experimental setup==<br />
<br />
Instead of a looking for a complete NN model, the search framework introduced in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_representations section 2] is applied to look for the best performing architectures of a small neural network module called the convolutional cell. Using small modules as building blocks to form a larger and more complex model is an approach proved to be successful in previous cases such as the Inception architecture. Additionally, this approach allowed the authors to evaluate cell candidates efficiently and scale to larger and more complex models faster.<br />
<br />
In total three models were implemented as hosts for the experimental cells, the first two use the CIFAR-10 dataset and the third uses the ImageNet dataset. The search framework is implemented only in the first host model to look for the best performing cells ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2]), once found, these cells were inserted into the second and third host models to evaluate overall performance on the respective datasets ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3]).<br />
<br />
The terms training time step, initialization time step, and evolutionary time step will be used to describe some parts of the experiments. Be aware that these three terms have different meanings; however, each term will be properly defined when introduced.<br />
<br />
==Architecture search on CIFAR-10==<br />
<br />
The overall goal in this stage is to find the best performing cells. The search framework is run using the small CIFAR-10 depicted in Figure 2 as host model for the cells; therefore, during the searching process, only the cells change while the rest of the host model’s structure remains the same. In the context of the evolutionary search algorithm, a cell is also called a candidate or a genotype. Additionally, on every time step during the search process, the three cells in the model will share the same structure and consequently every time a new candidate architecture is evaluated the three cells will simultaneously adopt the new candidate’s architecture.<br />
<br />
[[File:smallcifar10.PNG | 350px|thumb|center|Figure 2. Small CIFAR-10 model]]<br />
<br />
To begin the architecture searching process an initial population of genotypes is required. Random mutations are applied over a trivial genotype to generate a candidate and grow the seminal population. This is called an initialization step and is repeated 200 times to produce an equivalent number of candidates. Creating these 200 candidates with random structures is equivalent to running a random search over a constrained architecture space. <br />
<br />
Then, the evolutionary search algorithm takes over and runs from timestep 201 up to time step 7000, these are called evolutionary timesteps. On each evolutionary time step, a group of genotypes equivalent to 5% of the current population is selected randomly and sent to the tournament for fitness computation. To perform a fitness evaluation each candidate cell is inserted into the three predefined positions within the small CIFAR-10 host model. Then for each candidate cell, the host model is trained with stochastic gradient descent during 5000 training steps and decreasing learning rate. Due to observing a standard deviation of up to 0.2% when evaluating the exact same model, the overall fitness is obtained as the average of four training-evaluation runs. This variance is due to optimization. Finally, a random mutation is applied over a copy of the best cell within the group to create a new genotype that is added to the current population.<br />
<br />
The fitness of each evaluated genotype is recorded in the shared tabular memory file to avoid recalculation in case the same genotype is selected again in a future evolutionary time step.<br />
<br />
The search framework is run for 7000-time steps (200 initialization time steps and the rest are evolutionary time steps) for each one of three different types of cell architecture, namely hierarchical representation, flat representation and flat representation with constrained parameters. <br />
<br />
* A cell that follows a hierarchical representation has NN connections at three different levels; at the bottom level it has only primitive operations, at the second level it contains motifs with four-nodes and at the third level it has only one motif with five-nodes.<br />
<br />
* A cell that follows a flat representation has 11 nodes with only primitive operations between them. These cells look similar to level 2 motifs but instead of having four nodes they have 11 and therefore many more pairs of nodes and operations.<br />
<br />
* For a cell that follows a flat representation with constrained parameters the total number of parameters used by its operations cannot be superior to the total number of parameters used by the cells that follow a hierarchical representation.<br />
<br />
Figure 3 shows the current fitness achieved by the best performing cell from each one of the three types of cells when plugged in the small CIFAR-10 model. Even though the fitness grows rapidly after the first 200 (initialization) time steps, it tends to plateau between 89% to 90%. Overall, cells that follow a flat representation without restriction in the number of parameters tend to perform better than those following a hierarchical structure. It could be due to the fact that the flat representation allows more flexibility when adding connections between nodes, especially between distant ones. Unfortunately, the authors do not describe the architecture of the best performing flat cell.<br />
<br />
[[File:currentfitness.PNG | 300px|thumb|center|Figure 3. Current fitness]]<br />
<br />
Figure 4 presents the maximum fitness reached by any cell seen by the search framework between each one of the three types of cells, the fitness at time step 200 is, therefore, equivalent to the best model obtained by a random search over 200 architectures from each type of cell.<br />
<br />
[[File:maxfitness.PNG | 300px|thumb|center|Figure 4. Maximum fitness]]<br />
<br />
The total number of parameters used by each genotype at any given time step is shown in Figure 5. It suggests that flat representations tend to add more connections over time and most likely those connections correspond to convolutional operations which in turn require more parameters than other primitive operations.<br />
<br />
[[File:numparameters.PNG | 300px|thumb|center|Figure 5. Number of parameters]]<br />
<br />
To run each time step (either initialization or evolutionary) in the search framework, it takes one hour for a GPU to perform four training and evaluation rounds for every single candidate. Therefore, the authors used 200 GPUs simultaneously to complete 7000-time steps in 35 hours. Considering the three types of cell (hierarchical, flat, and parameter-constrained flat), approximately 20000 GPU-hours could be required to replicate the experiment.<br />
<br />
==Architecture evaluation on CIFAR-10 and ImageNet==<br />
<br />
Once the evolutionary search finds the best-fitted cells those are plug into the two larger host models to evaluate their performance in those more complex architectures. The first large model (Figure 6) is targeted to image classification on the CIFAR-10 dataset and the second model (Figure 7) is focused on image classification on the ImageNet dataset. Although all the parameters in these two larger host models are trained from scratch including those within the cells, no changes in the cell’s architectures will happen since their structure was found to be optimal during the evolutionary search.<br />
<br />
The large CIFAR-10 model is trained with stochastic gradient descent during 80K training steps and decreasing learning rate. To account for the non-negligible standard deviation found when evaluating the exact same model, the percentage of error is determined as the average of five training-evaluation runs.<br />
<br />
[[File:largecifar10.PNG | 500px|thumb|center|Figure 6. Large CIFAR-10 model]]<br />
<br />
The ImageNet model is trained with stochastic gradient descent during 200K training steps and decreasing learning rate. For this model, neither standard deviation nor multiple training-evaluation runs were reported.<br />
<br />
[[File:imagenetmodel.PNG | 600px|thumb|center|Figure 7. ImageNet model]]<br />
<br />
In [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2] three types of cells were described: hierarchical, flat, and parameter-constrained flat. For the hierarchical type of cells, the percentage of error in both large models is reported in Table 1 for four different cases: a cell with random architecture, the best-fitted cell from 200 random architectures, the best-fitted cell from 7000 random architectures, and the best-fitted cell after 7000 evolutionary steps. On the other hand, for the flat and parameter-constrained flat types of architecture, only some of the mentioned four cases are reported in Table 1.<br />
<br />
[[File:comparisoncells.PNG | 750px|thumb|center|Table 1. Comparison between types of cells and searching method]]<br />
<br />
According to the results in Table 1, for both large host models, the hierarchical cell found by the evolutionary search algorithm achieved the lowest errors with 3.75% in CIFAR-10, 20.3% top-1 error and 5.2% top-5 error in ImageNet. The errors reported in both datasets are calculated by using the trained large models on test sets of images never seen before during any of the previous stages. Even though the cell that follows a hierarchical representation achieved the lowest error, the ones showing the lowest standard deviations are those following a flat representation.<br />
<br />
The performance achieved by the large CIFAR-10 host model using the best cell is then compared against other classifiers in Table 2. As an additional improvement, the authors increased the number of channels in its first convolutional layer from 64 to 128. It is worth to note that this first convolutional layer is not part of the cell obtained during the evolutionary search process, instead, it is part of the original host model. The results are grouped into three categories depending on how the classifiers involved in the comparison were created, from top to bottom: handcrafted, reinforcement learning, and evolutionary algorithms.<br />
<br />
[[File:comparisonlargecifar10.PNG | 500px|thumb|center|Table 2. Comparison against other classifiers on CIFAR-10]]<br />
<br />
The classification error achieved by the ImageNet host model when using the best cell is also compared against some high performing image classifiers in the literature and the results are presented in Table 3. Although the classification error scored by the architecture introduced in this paper is not significantly lower than those obtained by state of the art classifiers, it shows outstanding results considering that it is not a hand engineered structure.<br />
<br />
[[File:comparisonimagenet.PNG | 500px|thumb|center|Table 3. Comparison against other classifiers on ImageNet]]<br />
<br />
A visualisation of the evolved hierarchical cell is shown below. The detailed visualisations of each motif can be seen in Appendix A of the paper. It can be noted that motif 4 directly links the input and output, and itself contains (among other operations) an identity mapping from input to output. Many other such 'skip connections' can be seen.<br />
<br />
[[File:WF_SecCont_03_hier_vis.png]]<br />
<br />
=Conclusion=<br />
<br />
A new evolutionary framework is introduced for searching neural network architectures over searching spaces defined by flat and hierarchical representations of a convolutional cell, which uses smaller operations instead of the larger ones as the building blocks. Experiments show that the proposed framework achieves competitive results against state of the art classifiers on the CIFAR-10 and ImageNet datasets.<br />
<br />
Also, compared to contemporary RL-based architecture search approaches, the proposed approach is generally faster with comparable performance.<br />
<br />
=Critique=<br />
<br />
While the method introduced in this paper achieves a lower error in comparison to other evolutionary methods, it is not significantly better than those obtained by handcrafted design or reinforcement learning. A more in-depth analysis considering the number of parameters and required computational resources would be necessary to accurately compare the listed methods.<br />
<br />
The paper does not provide enough reasons why the author chose specific two searching algorithms. Possibly more efficient searching are available, which can lead to better performance. Especially, when the performance of the algorithm is not significantly better than previous handcradted ones, this can be a possible technical improvements.<br />
<br />
In [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3] it is not clear why the results for the four different cases that are reported for the hierarchical cells in Table 1 are not reported for the ones following a flat representation, considering that the flat cells showed a better performance during the evolutionary search. Recall that the four cases are: a cell with random architecture, the best-fitted cell from 200 random architectures, the best-fitted cell from 7000 random architectures, and the best-fitted cell after 7000 evolutionary steps.<br />
<br />
It seems contradictory that the flat type of cells who clearly performed better than the hierarchical ones during the architecture search ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2]) are not the ones scoring the lowest error when evaluated on the two large host models ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3]).<br />
<br />
= References =<br />
<br />
# Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, Koray Kavukcuoglu, https://arxiv.org/abs/1711.00436.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search&diff=42141Hierarchical Representations for Efficient Architecture Search2018-11-30T22:46:33Z<p>Z43ma: </p>
<hr />
<div>Summary of the paper: [https://arxiv.org/abs/1711.00436 ''Hierarchical Representations for Efficient Architecture Search'']<br />
<br />
= Introduction =<br />
<br />
Deep Neural Networks (DNNs) have shown remarkable performance in several areas such as computer vision, natural language processing, among others; however, improvements over previous benchmarks have required extensive research and experimentation by domain experts. In DNNs, the composition of linear and nonlinear functions produce internal representations of data which are in most cases better than handcrafted ones; consequently, researchers using Deep Learning techniques have lately shifted their focus from working on input features to designing optimal DNN architectures. However, the quest for finding an optimal DNN architecture by combining layers and modules requires frequent trial and error experiments, a task that resembles the previous work on looking for handcrafted optimal features. As researchers aim to solve more difficult challenges the complexity of the resulting DNN is also increasing; therefore, some studies are introducing the use of automated techniques focused on searching for optimal architectures. The latest emerging field, Neural Architecture Search, is aimed to tackle exactly this problem. The goal of Neural Architecture Search is to try to transform the problem of designing a network into a search problem. For a search problem, it needs a clear definition of three things: the search space, the search strategy, and performance evaluation strategy. The search space is a high-level description of the architecture of the network. The search space needs to contain enough freedom such that the resulted model will have enough expressive power, but cannot be too broad thus makes the search process too computational consuming. The search strategy is how to efficiently search in the search space. The performance evaluation strategy is the methods that are used to evaluate the network. Here, the evaluation is more tricky because in order to evaluate a neural network, we need to train it first, and training takes time. So it is important to define a proxy task that can help us better evaluate a network. Here, this paper will tackle these problems with a new hierarchical representation.<br />
<br />
Lately, the use of algorithms for finding optimal DNN architectures has attracted the attention of researchers who have tackled the problem through four main groups of techniques. The first such method employs a supplementary network called a “Hypernet”, which generates ideal network weights given a random architecture. There are two main parts to generating an “optimal” architecture. First, we train the HyperNet. One training cycle consists of generating a random architecture from a sample space of allowed architectures and generating its predicted weights with the HyperNet. Then, the validation score of this proposed network is calculated, and the error is used to backpropagate through the HyperNet. In this manner, the HyperNet can learn to assign robustly optimal initial weights to a given architecture. At “test” time, we generate a random sample of architectures and predict initialized weights for each with our tuned HyperNet. We take the model with the highest validation score and train it as we would a regular architecture. We use this heuristic of “initial validation error” as the relative performance of networks typically stays constant throughout training. That is, if it starts of better, it will very likely end better. The second technique is Monte Carlo Tree Search (MCTS) which repeatedly narrows the search space by focusing on the most promising architectures previously seen. The third group of techniques use evolutionary algorithms where fitness criteria are applied to filter the initial population of DNN candidates, then new individuals are added to the population by selecting the best-performing ones and modifying them with one or several random mutations as in [https://arxiv.org/abs/1703.01041 [Real, 2017]]. The fourth and last group of techniques implement Reinforcement Learning where a policy based controller seeks to optimize the expected accuracy of new architectures based on rewards (accuracy) gained from previous proposals in the architecture space. From these four groups of techniques, Reinforcement Learning has offered the best experimental results; however, the paper we are summarizing implements evolutionary algorithms as its main approach.<br />
<br />
Despite the technique used to look for an optimal architecture, searching in the architecture space usually requires the training and evaluation of many DNN candidates; therefore, it demands huge computational resources and poses a significant limitation for practical applications. Consequently, most techniques narrow the search space with predefined heuristics, either at the beginning or dynamically during the searching process. In the paper we are summarizing, the authors reduce the number of feasible architectures by forcing a hierarchical structure between network components. In other words, each DNN suggested as a candidate is formed by combining basic building blocks to form small modules, then the same basic structures introduced on the building blocks are used to combine and stack networks on the upper levels of the hierarchy. This approach allows the searching algorithm to sample highly complex and modularized networks similar to Inception or ResNet.<br />
<br />
Despite some weaknesses regarding the efficiency of evolutionary algorithms, this study reveals that in fact, these techniques can generate architectures which show competitive performance when a narrowing strategy is imposed over the search space. Accordingly, the main contributions of this paper is a well-defined set of hierarchical representations which acts as the filtering criteria to pick DNN candidates and a novel evolutionary algorithm which produces image classifiers that achieve state of the art performance among similar evolutionary-based techniques.<br />
<br />
=Architecture representations=<br />
<br />
==Flat architecture representation==<br />
All the evaluated network architectures are directed acyclic graphs with only one source and one sink. Each node in the network represents a feature map and consequently, each directed edge represents an operation that takes the feature map in the departing node as input and outputs a feature map on the arriving node. Under the previous assumption, any given architecture in the narrowed search space is formally expressed as a graph assembled by a series of operations (edges) among a defined set of adjacent feature maps (nodes).<br />
<br />
[[File:flatarch.PNG | 650px|thumb|center|Flat architecture representation os neural networks]]<br />
<br />
Multiple primitive operations defined in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Primitive_operations section 2.3] are used to form small networks defined as ''motifs'' by the authors. To combine the outputs of multiple primitive operations and guarantee a unique output per motif the authors introduce a merge operation which in practice works as a depthwise concatenation that does not require inputs with the same number of channels.<br />
<br />
Accordingly, these motifs can also be combined to form more complex motifs on a higher level in the hierarchy until the network is complex enough to perform competitively in challenging classification tasks.<br />
<br />
==Hierarchical architecture representation==<br />
<br />
The composition of more complex motifs based on simpler motifs at lower levels allows the authors to create a hierarchy-like representation of very complex DNN starting with only a few primitive operations as shown in Figure 1. In other words, an architecture with <math> L </math> levels has only primitive operations at its bottom and only one complex motif at its top. Any motif in between the bottom and top levels can be defined as the composition of motifs in lower levels of the hierarchy.<br />
<br />
Formally, the <math>m</math>-th motif in level <math>l</math>, <math>o_m^{(l)}</math>, is recursively defined as the composition of lower-level motifs <math>\textbf{o}^{(l-1)}</math> according to its network structure.<br />
<br />
<center><math> o_m^{(l)}=assemble(G_m^{(l)}, \textbf{o}^{(l-1)})</math></center><br />
<br />
[[File:hierarchicalrep.PNG | 700px|thumb|center|Figure 1. Hierarchical architecture representation]]<br />
<br />
In figure 1, the architecture of the full model (its flat structure) is shown in the top right corner. The input (source) is the bottom-most node. The output (sink) is the topmost node. The paper presents an alternative hierarchical view of the model shown on the left-hand side (before the assemble function). This view represents the same model in three layers. The first layer is a set of primitive operations only (bottom row, middle column). In all other layers component motifs (computational graphs) G are described by an adjacency matrix and a set of operations. The set of operations are from the previous layer. An example motif <math> G^{(2)}_{1}</math> in the second layer is shown in the bottom row (left and middle columns). There are three unique motifs in the second layer. These are shown in the middle layer of the top row. Note that the motifs in the previous layer become the operations in the next layer. The higher layer can use these motifs multiple times. Finally, the top level graph, which contains only one motif, <math> G^{(3)}_{1}</math>, is shown in the top row left column. Here, there are 4 nodes with 6 operations defined between them.<br />
<br />
==Primitive operations==<br />
<br />
The six primitive operations used as building blocks for connecting nodes in either flat or hierarchical representations are:<br />
* 1 × 1 convolution of C channels<br />
* 3 × 3 depthwise convolution<br />
* 3 × 3 separable convolution of C channels<br />
* 3 × 3 max-pooling<br />
* 3 × 3 average-pooling<br />
* Identity mapping<br />
<br />
The authors argue that convolution operations involving larger receptive fields can be obtained by the composition of lower-level motifs with smaller receptive fields. Accordingly, convolution operations considering a large number of channels can be generated by the depthwise concatenation of lower-level motifs. Batch normalization and ''ReLU'' activation function are applied after each convolution in the network. There is a seventh operation called null and is used in the adjacency matrix <math> G </math> to state explicitly that there are no operations between two nodes.<br />
<br />
<br />
Side note:<br />
<br />
Some explanations for different types for convolution:<br />
<br />
* Spatial convolution: Convolutions performed in spatial dimensions - width and height.<br />
* Depthwise convolution: Spatial convolution performed independently over each channel of an input.<br />
* 1x1 convolution: Convolution with the kernel of size 1x1<br />
<br />
[[File:convolutions.png | 350px|thumb|center]]<br />
<br />
=Evolutionary architecture search=<br />
<br />
Before moving forward we introduce the concept of genotypes in the context of the article. In this article, a genotype is a particular neural network architecture defined according to the components described in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_representations section 2]. In order to make the NN architectures ''evolve'' the authors implemented a three stages process that includes establishing the permitted mutations, creating an initial population and make them compete in a tournament where only the best candidates will survive.<br />
<br />
==Mutation==<br />
<br />
One mutation over a specific architecture is a sequence of five changes in the following order:<br />
<br />
* Sample a level in the hierarchy, different than the basic level.<br />
* Sample a motif in that level.<br />
* Sample a successor node <math>(i)</math> in the motif.<br />
* Sample a predecessor node <math>(j)</math> in the motif.<br />
* Replace the current operation between nodes <math>i</math> and <math>j</math> from one of the available operations.<br />
<br />
The original operation between the nodes <math>i</math> and <math>j</math> in the graph is defined as <math> [G_{m}^{\left ( l \right )}] _{ij} = k </math>. Therefore, a mutation between the same pair of nodes is defined as <math> [G_{m}^{\left ( l \right )}] _{ij} = {k}' </math>.<br />
<br />
The allowed mutations include:<br />
# Change the basic primitive between the predecessor and successor nodes (ie. alter an existing edge): if <math>o_k^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} \neq >o_k^{(l-1)}</math><br />
# Add a connection between two previously unconnected nodes. The connection between the node can have any of the six possible primitives: if <math>o_k^{(l-1)}=none</math> and <math>o_{k'}^{(l-1)} \neq none</math><br />
# Remove a connection between existing nodes: if <math>o_k^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} = none</math><br />
<br />
==Initialization==<br />
<br />
An initial population is required to start the evolutionary algorithm; therefore, the authors introduced a trivial genotype (candidate solution, the hierarchical architecture of the model) composed only of identity mapping operations. Then a large number of random mutations was run over the ''trivial genotype'' to simulate a diversification process. The authors argue that this diversification process generates a representative population in the search space and at the same time prevents the use of any handcrafted NN structures. Surprisingly, some of these random architectures show a performance comparable to the performance achieved by the architectures found later during the evolutionary search algorithm.<br />
<br />
==Search algorithms==<br />
<br />
Tournament selection and random search are the two search algorithms used by the authors. <br />
<br />
=== Tournament Selection ===<br />
In one iteration of the tournament selection algorithm, 5% of the entire population is randomly selected, trained, and evaluated against a validation set. Then the best performing genotype is picked to go through the mutation process and put back into the population. No genotype is ever removed from the population, but the selection criteria guarantee that only the best performing models will be selected to ''evolve'' through the mutation process.<br />
<br />
We define the pseudocode for tournament selection as follows:<br />
<br />
1. Choose k (the tournament size) individuals from the population at random<br />
<br />
2. Choose the best individual from the tournament with probability p<br />
<br />
3. Choose the second best individual with probability p*(1-p)<br />
<br />
4. Choose the third best individual with probability p*((1-p)^2)<br />
<br />
5. Continue until number of selected individuals equal the number we desire.<br />
<br />
Tournament selection is often chosen over alternative genetic algorithms due to the following benefits: it is efficient to code, works on parallel architectures and allows the selection pressure to be easily adjusted.<br />
<br />
=== Random Search ===<br />
In the random search algorithm every genotype from the initial population is trained and evaluated, then the best performing model is selected. In contrast to the tournament selection algorithm, the random search algorithm is much simpler and the training and evaluation process for every genotype can be run in parallel to reduce search time. This algorithm is not widely studied in literature yet.<br />
<br />
==Implementation==<br />
<br />
To implement the tournament selection algorithm two auxiliary algorithms are introduced. The first is called the controller and directs the evolution process over the population, in other words, the controller repeatedly picks 5% of genotypes from the current population, send them to the tournament and then apply a random mutation over the best performing genotype from each group. <br />
<br />
[[File:asyncevoalgorithm1.PNG | 700px|thumb|center|Controller]]<br />
<br />
The second auxiliary algorithm is called the worker and is in charge of training and evaluating each genotype, a task that must be completed each time a new genotype is created and added to the population either by an initialization step or by an evolutionary step.<br />
<br />
[[File:asyncevoalgorithm2.PNG | 700px|thumb|center|Worker]]<br />
<br />
Both auxiliary algorithms work together asynchronously and communicate each other through a shared tabular memory file where genotypes and their corresponding fitness are recorded.<br />
<br />
=Experiments and results=<br />
<br />
==Experimental setup==<br />
<br />
Instead of a looking for a complete NN model, the search framework introduced in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_representations section 2] is applied to look for the best performing architectures of a small neural network module called the convolutional cell. Using small modules as building blocks to form a larger and more complex model is an approach proved to be successful in previous cases such as the Inception architecture. Additionally, this approach allowed the authors to evaluate cell candidates efficiently and scale to larger and more complex models faster.<br />
<br />
In total three models were implemented as hosts for the experimental cells, the first two use the CIFAR-10 dataset and the third uses the ImageNet dataset. The search framework is implemented only in the first host model to look for the best performing cells ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2]), once found, these cells were inserted into the second and third host models to evaluate overall performance on the respective datasets ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3]).<br />
<br />
The terms training time step, initialization time step, and evolutionary time step will be used to describe some parts of the experiments. Be aware that these three terms have different meanings; however, each term will be properly defined when introduced.<br />
<br />
==Architecture search on CIFAR-10==<br />
<br />
The overall goal in this stage is to find the best performing cells. The search framework is run using the small CIFAR-10 depicted in Figure 2 as host model for the cells; therefore, during the searching process, only the cells change while the rest of the host model’s structure remains the same. In the context of the evolutionary search algorithm, a cell is also called a candidate or a genotype. Additionally, on every time step during the search process, the three cells in the model will share the same structure and consequently every time a new candidate architecture is evaluated the three cells will simultaneously adopt the new candidate’s architecture.<br />
<br />
[[File:smallcifar10.PNG | 350px|thumb|center|Figure 2. Small CIFAR-10 model]]<br />
<br />
To begin the architecture searching process an initial population of genotypes is required. Random mutations are applied over a trivial genotype to generate a candidate and grow the seminal population. This is called an initialization step and is repeated 200 times to produce an equivalent number of candidates. Creating these 200 candidates with random structures is equivalent to running a random search over a constrained architecture space. <br />
<br />
Then, the evolutionary search algorithm takes over and runs from timestep 201 up to time step 7000, these are called evolutionary timesteps. On each evolutionary time step, a group of genotypes equivalent to 5% of the current population is selected randomly and sent to the tournament for fitness computation. To perform a fitness evaluation each candidate cell is inserted into the three predefined positions within the small CIFAR-10 host model. Then for each candidate cell, the host model is trained with stochastic gradient descent during 5000 training steps and decreasing learning rate. Due to a small standard deviation of up to 0.2% found when evaluating the exact same model, the overall fitness is obtained as the average of four training-evaluation runs. Finally, a random mutation is applied over a copy of the best cell within the group to create a new genotype that is added to the current population.<br />
<br />
The fitness of each evaluated genotype is recorded in the shared tabular memory file to avoid recalculation in case the same genotype is selected again in a future evolutionary time step.<br />
<br />
The search framework is run for 7000-time steps (200 initialization time steps and the rest are evolutionary time steps) for each one of three different types of cell architecture, namely hierarchical representation, flat representation and flat representation with constrained parameters. <br />
<br />
* A cell that follows a hierarchical representation has NN connections at three different levels; at the bottom level it has only primitive operations, at the second level it contains motifs with four-nodes and at the third level it has only one motif with five-nodes.<br />
<br />
* A cell that follows a flat representation has 11 nodes with only primitive operations between them. These cells look similar to level 2 motifs but instead of having four nodes they have 11 and therefore many more pairs of nodes and operations.<br />
<br />
* For a cell that follows a flat representation with constrained parameters the total number of parameters used by its operations cannot be superior to the total number of parameters used by the cells that follow a hierarchical representation.<br />
<br />
Figure 3 shows the current fitness achieved by the best performing cell from each one of the three types of cells when plugged in the small CIFAR-10 model. Even though the fitness grows rapidly after the first 200 (initialization) time steps, it tends to plateau between 89% to 90%. Overall, cells that follow a flat representation without restriction in the number of parameters tend to perform better than those following a hierarchical structure. It could be due to the fact that the flat representation allows more flexibility when adding connections between nodes, especially between distant ones. Unfortunately, the authors do not describe the architecture of the best performing flat cell.<br />
<br />
[[File:currentfitness.PNG | 300px|thumb|center|Figure 3. Current fitness]]<br />
<br />
Figure 4 presents the maximum fitness reached by any cell seen by the search framework between each one of the three types of cells, the fitness at time step 200 is, therefore, equivalent to the best model obtained by a random search over 200 architectures from each type of cell.<br />
<br />
[[File:maxfitness.PNG | 300px|thumb|center|Figure 4. Maximum fitness]]<br />
<br />
The total number of parameters used by each genotype at any given time step is shown in Figure 5. It suggests that flat representations tend to add more connections over time and most likely those connections correspond to convolutional operations which in turn require more parameters than other primitive operations.<br />
<br />
[[File:numparameters.PNG | 300px|thumb|center|Figure 5. Number of parameters]]<br />
<br />
To run each time step (either initialization or evolutionary) in the search framework, it takes one hour for a GPU to perform four training and evaluation rounds for every single candidate. Therefore, the authors used 200 GPUs simultaneously to complete 7000-time steps in 35 hours. Considering the three types of cell (hierarchical, flat, and parameter-constrained flat), approximately 20000 GPU-hours could be required to replicate the experiment.<br />
<br />
==Architecture evaluation on CIFAR-10 and ImageNet==<br />
<br />
Once the evolutionary search finds the best-fitted cells those are plug into the two larger host models to evaluate their performance in those more complex architectures. The first large model (Figure 6) is targeted to image classification on the CIFAR-10 dataset and the second model (Figure 7) is focused on image classification on the ImageNet dataset. Although all the parameters in these two larger host models are trained from scratch including those within the cells, no changes in the cell’s architectures will happen since their structure was found to be optimal during the evolutionary search.<br />
<br />
The large CIFAR-10 model is trained with stochastic gradient descent during 80K training steps and decreasing learning rate. To account for the non-negligible standard deviation found when evaluating the exact same model, the percentage of error is determined as the average of five training-evaluation runs.<br />
<br />
[[File:largecifar10.PNG | 500px|thumb|center|Figure 6. Large CIFAR-10 model]]<br />
<br />
The ImageNet model is trained with stochastic gradient descent during 200K training steps and decreasing learning rate. For this model, neither standard deviation nor multiple training-evaluation runs were reported.<br />
<br />
[[File:imagenetmodel.PNG | 600px|thumb|center|Figure 7. ImageNet model]]<br />
<br />
In [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2] three types of cells were described: hierarchical, flat, and parameter-constrained flat. For the hierarchical type of cells, the percentage of error in both large models is reported in Table 1 for four different cases: a cell with random architecture, the best-fitted cell from 200 random architectures, the best-fitted cell from 7000 random architectures, and the best-fitted cell after 7000 evolutionary steps. On the other hand, for the flat and parameter-constrained flat types of architecture, only some of the mentioned four cases are reported in Table 1.<br />
<br />
[[File:comparisoncells.PNG | 750px|thumb|center|Table 1. Comparison between types of cells and searching method]]<br />
<br />
According to the results in Table 1, for both large host models, the hierarchical cell found by the evolutionary search algorithm achieved the lowest errors with 3.75% in CIFAR-10, 20.3% top-1 error and 5.2% top-5 error in ImageNet. The errors reported in both datasets are calculated by using the trained large models on test sets of images never seen before during any of the previous stages. Even though the cell that follows a hierarchical representation achieved the lowest error, the ones showing the lowest standard deviations are those following a flat representation.<br />
<br />
The performance achieved by the large CIFAR-10 host model using the best cell is then compared against other classifiers in Table 2. As an additional improvement, the authors increased the number of channels in its first convolutional layer from 64 to 128. It is worth to note that this first convolutional layer is not part of the cell obtained during the evolutionary search process, instead, it is part of the original host model. The results are grouped into three categories depending on how the classifiers involved in the comparison were created, from top to bottom: handcrafted, reinforcement learning, and evolutionary algorithms.<br />
<br />
[[File:comparisonlargecifar10.PNG | 500px|thumb|center|Table 2. Comparison against other classifiers on CIFAR-10]]<br />
<br />
The classification error achieved by the ImageNet host model when using the best cell is also compared against some high performing image classifiers in the literature and the results are presented in Table 3. Although the classification error scored by the architecture introduced in this paper is not significantly lower than those obtained by state of the art classifiers, it shows outstanding results considering that it is not a hand engineered structure.<br />
<br />
[[File:comparisonimagenet.PNG | 500px|thumb|center|Table 3. Comparison against other classifiers on ImageNet]]<br />
<br />
A visualisation of the evolved hierarchical cell is shown below. The detailed visualisations of each motif can be seen in Appendix A of the paper. It can be noted that motif 4 directly links the input and output, and itself contains (among other operations) an identity mapping from input to output. Many other such 'skip connections' can be seen.<br />
<br />
[[File:WF_SecCont_03_hier_vis.png]]<br />
<br />
=Conclusion=<br />
<br />
A new evolutionary framework is introduced for searching neural network architectures over searching spaces defined by flat and hierarchical representations of a convolutional cell, which uses smaller operations instead of the larger ones as the building blocks. Experiments show that the proposed framework achieves competitive results against state of the art classifiers on the CIFAR-10 and ImageNet datasets.<br />
<br />
Also, compared to contemporary RL-based architecture search approaches, the proposed approach is generally faster with comparable performance.<br />
<br />
=Critique=<br />
<br />
While the method introduced in this paper achieves a lower error in comparison to other evolutionary methods, it is not significantly better than those obtained by handcrafted design or reinforcement learning. A more in-depth analysis considering the number of parameters and required computational resources would be necessary to accurately compare the listed methods.<br />
<br />
The paper does not provide enough reasons why the author chose specific two searching algorithms. Possibly more efficient searching are available, which can lead to better performance. Especially, when the performance of the algorithm is not significantly better than previous handcradted ones, this can be a possible technical improvements.<br />
<br />
In [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3] it is not clear why the results for the four different cases that are reported for the hierarchical cells in Table 1 are not reported for the ones following a flat representation, considering that the flat cells showed a better performance during the evolutionary search. Recall that the four cases are: a cell with random architecture, the best-fitted cell from 200 random architectures, the best-fitted cell from 7000 random architectures, and the best-fitted cell after 7000 evolutionary steps.<br />
<br />
It seems contradictory that the flat type of cells who clearly performed better than the hierarchical ones during the architecture search ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2]) are not the ones scoring the lowest error when evaluated on the two large host models ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3]).<br />
<br />
= References =<br />
<br />
# Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, Koray Kavukcuoglu, https://arxiv.org/abs/1711.00436.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=42139Countering Adversarial Images Using Input Transformations2018-11-30T22:41:50Z<p>Z43ma: </p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:non-targeted O.JPG| 600px|center]]<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:Targeted O.JPG| 600px|center]]<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
<center><math><br />
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) &ne; h({x_n}^\prime)],<br />
</math></center><br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack. <br />
<br />
However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient black-box attack. <br />
<br />
As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but<br />
is unaware of the defence strategy that is being used.<br />
<br />
A defence is an approach that aims make the prediction on an adversarial example <math>h(x^')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, teh authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x^'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.<br />
<br />
==Adversarial Attacks==<br />
<br />
Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack. <br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.<br />
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.<br />
<br />
[[File:tvx.png]]<br />
<br />
The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.<br />
<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property.<br />
<br />
Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.<br />
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction<br />
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K<br />
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model. <br />
<br />
Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".<br />
<br />
This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=42138Countering Adversarial Images Using Input Transformations2018-11-30T22:41:07Z<p>Z43ma: </p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:non-targeted O.JPG| 600px|center]]<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:Targeted O.JPG| 600px|center]]<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
<center><math><br />
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) &ne; h({x_n}^\prime)],<br />
</math></center><br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack. <br />
<br />
However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient gray-box attack. <br />
<br />
As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but<br />
is unaware of the defence strategy that is being used.<br />
<br />
A defence is an approach that aims make the prediction on an adversarial example <math>h(x^')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, teh authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x^'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.<br />
<br />
==Adversarial Attacks==<br />
<br />
Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack. <br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.<br />
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.<br />
<br />
[[File:tvx.png]]<br />
<br />
The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.<br />
<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property.<br />
<br />
Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.<br />
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction<br />
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K<br />
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model. <br />
<br />
Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".<br />
<br />
This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=42137Countering Adversarial Images Using Input Transformations2018-11-30T22:40:48Z<p>Z43ma: </p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et. al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:non-targeted O.JPG| 600px|center]]<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:Targeted O.JPG| 600px|center]]<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
<center><math><br />
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) &ne; h({x_n}^\prime)],<br />
</math></center><br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack. <br />
<br />
However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient gray-box attack. <br />
<br />
As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but<br />
is unaware of the defence strategy that is being used.<br />
<br />
A defence is an approach that aims make the prediction on an adversarial example <math>h(x^')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, teh authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x^'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.<br />
<br />
==Adversarial Attacks==<br />
<br />
Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack. <br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.<br />
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.<br />
<br />
[[File:tvx.png]]<br />
<br />
The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.<br />
<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property.<br />
<br />
Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.<br />
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction<br />
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K<br />
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model. <br />
<br />
Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".<br />
<br />
This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=42136Countering Adversarial Images Using Input Transformations2018-11-30T22:38:12Z<p>Z43ma: </p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et. al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:non-targeted O.JPG| 600px|center]]<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:Targeted O.JPG| 600px|center]]<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
<center><math><br />
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) &ne; h({x_n}^\prime)],<br />
</math></center><br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack. <br />
<br />
However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient gray-box attack. <br />
<br />
As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but<br />
is unaware of the defense strategy that is being used.<br />
A defense is an approach that aims make the prediction on an adversarial example h(x<br />
0<br />
) equal<br />
to the prediction on the corresponding clean example h(x). In this study, we focus on imagetransformation<br />
defenses g(x) that perform prediction via h(g(x<br />
0<br />
)). Ideally, g(·) is a complex, nondifferentiable,<br />
and potentially stochastic function: this makes it difficult for an adversary to attack<br />
the prediction model h(g(x)) even when the adversary knows both h(·) and g(·).<br />
<br />
==Adversarial Attacks==<br />
<br />
Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack. <br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.<br />
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.<br />
<br />
[[File:tvx.png]]<br />
<br />
The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.<br />
<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property.<br />
<br />
Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.<br />
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction<br />
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K<br />
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model. <br />
<br />
Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".<br />
<br />
This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples&diff=42135Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples2018-11-30T22:33:01Z<p>Z43ma: </p>
<hr />
<div>= Introduction =<br />
<br />
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are originally correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios. <br />
<br />
The seriousness of this threat has generated major interest in both the design and defense against them. Recently, many new defenses have been proposed that claim robustness against iterative white-box adversarial attacks. This result is somewhat surprising, given that iterative white-box attacks are one of the most difficult classes of adversarial attacks. In this paper, the authors identify a common flaw, masked gradients, in many of these defenses that cause them to ''perceive'' a high accuracy on adversarial images. This flaw is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference were found to contain them. The authors develop three attacks, specifically targeting masked gradients, and show that the actual accuracy of these defenses is much lower than claimed. In fact, the majority of these attacks were found to be ineffective against true iterative white box attacks.<br />
<br />
= Methodology =<br />
<br />
The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.<br />
<br />
== Background ==<br />
<br />
==== Adversarial Images Mathematically ====<br />
<br />
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:<br />
# <math>D(x,x') < \epsilon </math><br />
# <math>c(x') \neq c^*(x) </math><br />
<br />
Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.<br />
<br />
==== Adversarial Attacks Terminology ====<br />
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.<br />
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize'' loss for a target class <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.<br />
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.<br />
<br />
In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.<br />
<br />
== Obfuscated Gradients ==<br />
<br />
If gradients are masked, they cannot be followed to generate adversarial images, gradient masking is known to be an incomplete defense to adversarial images[Papernot et al., 2017; Tramer et al., 2018]. A defense method may appear to be providing robustness, but in reality, the gradients in the network cannot be followed to generate strong adversarial images. Generated adversarial images from these networks are much weaker and when used to evaluate the model robustness five a false sense of security against adversarial attacks. Defenses are designed in a way that the constructed defense inevitably leads to gradient masking as obfuscated gradients.<br />
<br />
Some defences break gradient descent deliberately, others may do it unintentionally. Some indicators of a broken gradient descent are as follows:<br />
<br />
#One-step attacks perform better than iterative attacks, which are strictly stronger, so this shouldn’t be the case. If single-step methods are working better, it’s a sign the iterative attack is becoming stuck at a local minimum.<br />
#Black-box attacks work better than white-box attacks. The black-box threat model is a strict subset of white-box attacks, so white-box attacks should perform better. When a defence obfuscates gradients, then black-box attacks (which don’t use it) often perform better.<br />
#Unbounded attacks do not reach 100% success. With unbounded distortion, any classifier should eventually fail. An attack that doesn’t achieve this should be improved (i.e., it’s a weak attack, not necessarily a strong defence).<br />
#Random sampling finds adversarial examples. Random sampling (e.g., randomly sampling <math>10^5</math> or more points within some <math>\epsilon</math>-ball) should not find adversarial examples when gradient-based attacks do not.<br />
#Increasing the distortion bound does not increase success. Usually, a monotonically increasing attack success rate with increasing distortion bound is expected.<br />
In the defenses proposed in ICLR 2018, there are three ways which defense obfuscate gradients:<br />
<br />
# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally. <br />
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself. <br />
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.<br />
<br />
'''Detecting Obfuscated Gradients''':<br />
<br />
The authors propose a number of tests that might help detect when a defence relies on obfuscated gradients.<br />
<br />
Iterative attacks should work better than single-step attacks, since iterative attacks are strictly stronger than single-step attacks.<br />
White-box attacks should perform better than black-box attacks, since the black-box threat model is a strict subset of the white-box threat model.<br />
Attacks with an unbounded distortion metric (e.g. L_2 norm) should find adversarial examples with 100% success.<br />
Optimization-based attacks should perform better than brute-force sampling of nearby inputs (sampling within an ϵ-ball).<br />
These tests may not cover all cases of obfuscated gradients, but they indicate when intuitive properties start to break down. All defences with obfuscated gradients discussed by the authors fail at least one test.<br />
<br />
== The Attacks ==<br />
<br />
To circumvent these gradient masking techniques, the authors propose:<br />
# '''Backward Pass Differentiable Approximation (BPDA)''': For defences that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation. <br />
# '''Expectation over Transformation [Athalye, 2017]''': For defences that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.<br />
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.<br />
They assume that given a classifier <math display = "inline">f(g(x))</math>, <math display = "inline">g(·)</math> performs some optimization loop to transform the input x to a new input <math display = "inline">\hat x</math>. Often times, differentiating through <math display = "inline">g(·)</math> yields exploding or vanishing gradients.<br />
<br />
To resolve this, they make a change-of-variable <math display = "inline">x = h(z)</math> for some function <math display = "inline">h(·)</math> such that <math display = "inline">g(h(z)) = h(z)</math> for all z, but <math display = "inline">h(·)</math> is differentiable. This allows them to compute gradients through f(h(z)) and hence circumvent the defense.<br />
<br />
= Main Results =<br />
<br />
[[File:Summary_Table.png|600px|center]]<br />
<br />
The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defence targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images. <br />
<br />
Standard models were used in evaluating the accuracy of defense strategies under the attacks,<br />
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)<br />
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)<br />
# Imagenet: InceptionV3 (78.0% top-1 accuracy)<br />
<br />
The last column shows the accuracies each defence method achieved over the adversarial test set. Except for [Madry, 2018], all defence methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically, the defense can be circumvented using their proposed method.<br />
<br />
==== The defense that worked - Adversarial Training [Madry, 2018] ====<br />
<br />
As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:<br />
\begin{align}<br />
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} <br />
\end{align}<br />
<br />
where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>. <br />
<br />
Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.<br />
<br />
This approach was shown to provide resilience to all types of adversarial attacks.<br />
<br />
==== How to check for Obfuscated Gradients ====<br />
For future defence proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if a defence is relying on masked gradients:<br />
# If weaker one-step attacks are performing better than iterative attacks.<br />
# Black-box attacks can find stronger adversarial images compared with white-box attacks.<br />
# Unbounded iterative attacks do not reach 100% success.<br />
# If random brute force attempts are better than gradient-based methods at finding adversarial images.<br />
<br />
= Detailed Results =<br />
<br />
As a case study for evaluating the prevalence of obfuscated gradients, the authors studied the ICLR 2018 non-certified defenses that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white box attacks. It is reported that seven of these nine defenses depend on this phenomenon, and the authors demonstrate that their techniques can completely circumvent six of those (and partially circumvent one) that depend on obfuscated gradients.<br />
<br />
== Non-obfuscated Gradients ==<br />
<br />
==== Cascade Adversarial Training, [Na, 2018] ====<br />
<br />
'''Defense''': Similar to the method of [Madry, 2018], the authors of [Na, 2018] propose adversarial training. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.<br />
<br />
'''Attack''': The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.<br />
<br />
== Gradient Shattering ==<br />
<br />
==== Thermometer Coding, [Buckman, 2018] ====<br />
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,<br />
<br />
Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:<br />
\begin{align}<br />
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}<br />
1 \space if \thinspace x_{i,j,c} > \dfrac{k}{l} \\<br />
0 \space otherwise \\<br />
\end{array}<br />
\end{align}<br />
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.<br />
<br />
On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.<br />
<br />
'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,<br />
\begin{align}<br />
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )<br />
\end{align}<br />
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.<br />
<br />
==== Input Transformation, [Guo, 2018] ====<br />
<br />
'''Defence''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. As a baseline, the authors evaluate image cropping and rescaling, bit-depth reduction, and JPEG compression. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high-frequency noise while preserving legitimate edges (good high-frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the database. The authors explore different combinations of input transformations along with different underlying ImageNet classifiers, including adversarially trained models. They find that input transformations provide protection even with a vanilla classifier.<br />
<br />
Both techniques, removed perturbations from adversarial images which provide some robustness to adversarial attacks. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations. However, both approaches are non-differentiable and contain test time randomness as the modifications made are input dependent. Gradient flow to the input is non-differentiable and random.<br />
<br />
'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.<br />
<br />
==== Local Intrinsic Dimensionality, [Ma, 2018] ====<br />
<br />
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,<br />
<br />
\begin{align}<br />
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}<br />
\end{align}<br />
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k. <br />
<br />
First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defence method, but a method to study the properties of adversarial images.<br />
<br />
'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.<br />
<br />
== Stochastic Gradients ==<br />
<br />
==== Stochastic Activation Pruning, [Dhillon, 2018] ====<br />
'''Defence''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Because adversarial perturbations are like noises, randomly dropping activation can decrease their collective impact on the classifier. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.<br />
<br />
'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack, they were able to reduce the accuracy of this method down to 0% on CIFAR-10.<br />
<br />
==== Mitigation Through Randomization, [Xie, 2018] ====<br />
'''Defence''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next, the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.<br />
<br />
'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.<br />
<br />
== Vanishing and Exploding Gradients ==<br />
<br />
==== Pixel Defend, [Song, 2018] ====<br />
'''Defence''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back into the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],<br />
<br />
\begin{align}<br />
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))<br />
\end{align}<br />
<br />
The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient, all the way to the input would not be possible because of the vanishing/exploding gradient<br />
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.<br />
<br />
'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this approach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.<br />
<br />
==== Defence-GAN, [Samangouei, 2018] ====<br />
<br />
Before classifying the samples, Defence-GAN projects them onto the data manifold utilizing GAN. The intuition behind this approach is almost similar to that of PixelDefend. It uses GAN instead of pixel CNN.<br />
<br />
The authors used MNIST because CIFAR-10 is not argued secure. They found adversarial examples exist in the generator manifold, and they can construct an example. A perfect projector will not be able to modify this example, however, an imperfect gradient descent approach does not perfectly preserve manifold points. Therefore, the authors attacked DEFENSE-GAN using BPDA, but can only get a 45% success rate.<br />
<br />
<br />
= Conclusion =<br />
In this paper, it was found that gradient masking is a common flaw in many defences claiming robustness against white box adversarial attacks. This leads to a perceived robustness against adversarial attacks when in reality it results in weaker adversarial image construction. The authors develop three attacks that can overcome gradient masking. With their attacks, they found that actual robustness of 7 out of the 9 defences proposed in ICLR-2018, is significantly lower. In fact, many defences were found to be completely ineffective.<br />
<br />
Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attack occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluations of incoming data.<br />
<br />
= Critique =<br />
<br />
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.<br />
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret. <br />
## A notable exception to this case could be applications that are built using open-source (or even published) models that are paired with model-agnostic defense mechanisms. For example, A ResNet-50 using the model-agnostic 'input transformations' technique by [Guo, 2018] may be used in many different image classification tasks, but could still be successfully attacked using BPDA. <br />
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.<br />
# In general, the research field of adversarial attack would benefit from having an all-encompassing benchmark or dataset, so that the various approaches can be objectively compared and evaluated.<br />
<br />
= Other Sources =<br />
<br />
# Their re-implementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].<br />
<br />
= References =<br />
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.<br />
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.<br />
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.<br />
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.<br />
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.<br />
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.<br />
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.<br />
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.<br />
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).<br />
# '''[Na, 2018]''' Na, T., Ko, J.H. and Mukhopadhyay, S., 2017. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv preprint arXiv:1708.02582.<br />
# '''[Papernot et al., 2017]''' Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506–519, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4.<br />
# '''[Tramer et al., 2018]''' Tramer, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples&diff=42133Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples2018-11-30T22:28:20Z<p>Z43ma: </p>
<hr />
<div>= Introduction =<br />
<br />
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are originally correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios. <br />
<br />
The seriousness of this threat has generated major interest in both the design and defense against them. Recently, many new defenses have been proposed that claim robustness against iterative white-box adversarial attacks. This result is somewhat surprising, given that iterative white-box attacks are one of the most difficult classes of adversarial attacks. In this paper, the authors identify a common flaw, masked gradients, in many of these defenses that cause them to ''perceive'' a high accuracy on adversarial images. This flaw is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference were found to contain them. The authors develop three attacks, specifically targeting masked gradients, and show that the actual accuracy of these defenses is much lower than claimed. In fact, the majority of these attacks were found to be ineffective against true iterative white box attacks.<br />
<br />
= Methodology =<br />
<br />
The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.<br />
<br />
== Background ==<br />
<br />
==== Adversarial Images Mathematically ====<br />
<br />
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:<br />
# <math>D(x,x') < \epsilon </math><br />
# <math>c(x') \neq c^*(x) </math><br />
<br />
Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.<br />
<br />
==== Adversarial Attacks Terminology ====<br />
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.<br />
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize'' loss for a target class <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.<br />
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.<br />
<br />
In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.<br />
<br />
== Obfuscated Gradients ==<br />
<br />
If gradients are masked, they cannot be followed to generate adversarial images, gradient masking is known to be an incomplete defense to adversarial images[Papernot et al., 2017; Tramer et al., 2018]. A defense method may appear to be providing robustness, but in reality, the gradients in the network cannot be followed to generate strong adversarial images. Generated adversarial images from these networks are much weaker and when used to evaluate the model robustness five a false sense of security against adversarial attacks. Defenses are designed in a way that the constructed defense inevitably leads to gradient masking as obfuscated gradients.<br />
<br />
Some defences break gradient descent deliberately, others may do it unintentionally. Some indicators of a broken gradient descent are as follows:<br />
<br />
#One-step attacks perform better than iterative attacks, which are strictly stronger, so this shouldn’t be the case. If single-step methods are working better, it’s a sign the iterative attack is becoming stuck at a local minimum.<br />
#Black-box attacks work better than white-box attacks. The black-box threat model is a strict subset of white-box attacks, so white-box attacks should perform better. When a defence obfuscates gradients, then black-box attacks (which don’t use it) often perform better.<br />
#Unbounded attacks do not reach 100% success. With unbounded distortion, any classifier should eventually fail. An attack that doesn’t achieve this should be improved (i.e., it’s a weak attack, not necessarily a strong defence).<br />
#Random sampling finds adversarial examples. Random sampling (e.g., randomly sampling <math>10^5</math> or more points within some <math>\epsilon</math>-ball) should not find adversarial examples when gradient-based attacks do not.<br />
#Increasing the distortion bound does not increase success. Usually, a monotonically increasing attack success rate with increasing distortion bound is expected.<br />
In the defenses proposed in ICLR 2018, there are three ways which defense obfuscate gradients:<br />
<br />
# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally. <br />
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself. <br />
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.<br />
<br />
'''Detecting Obfuscated Gradients''':<br />
<br />
The authors propose a number of tests that might help detect when a defence relies on obfuscated gradients.<br />
<br />
Iterative attacks should work better than single-step attacks, since iterative attacks are strictly stronger than single-step attacks.<br />
White-box attacks should perform better than black-box attacks, since the black-box threat model is a strict subset of the white-box threat model.<br />
Attacks with an unbounded distortion metric (e.g. L_2 norm) should find adversarial examples with 100% success.<br />
Optimization-based attacks should perform better than brute-force sampling of nearby inputs (sampling within an ϵ-ball).<br />
These tests may not cover all cases of obfuscated gradients, but they indicate when intuitive properties start to break down. All defences with obfuscated gradients discussed by the authors fail at least one test.<br />
<br />
== The Attacks ==<br />
<br />
To circumvent these gradient masking techniques, the authors propose:<br />
# '''Backward Pass Differentiable Approximation (BPDA)''': For defences that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation. <br />
# '''Expectation over Transformation [Athalye, 2017]''': For defences that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.<br />
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.<br />
They assume that given a classifier <math display = "inline">f(g(x))</math>, <math display = "inline">g(·)</math> performs some optimization loop to transform the input x to a new input <math display = "inline">\hat x</math>. Often times, differentiating through <math display = "inline">g(·)</math> yields exploding or vanishing gradients.<br />
<br />
To resolve this, they make a change-of-variable <math display = "inline">x = h(z)</math> for some function <math display = "inline">h(·)</math> such that <math display = "inline">g(h(z)) = h(z)</math> for all z, but <math display = "inline">h(·)</math> is differentiable. This allows them to compute gradients through f(h(z)) and hence circumvent the defense.<br />
<br />
= Main Results =<br />
<br />
[[File:Summary_Table.png|600px|center]]<br />
<br />
The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defence targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images. <br />
<br />
Standard models were used in evaluating the accuracy of defense strategies under the attacks,<br />
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)<br />
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)<br />
# Imagenet: InceptionV3 (78.0% top-1 accuracy)<br />
<br />
The last column shows the accuracies each defence method achieved over the adversarial test set. Except for [Madry, 2018], all defence methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically, the defense can be circumvented using their proposed method.<br />
<br />
==== The defense that worked - Adversarial Training [Madry, 2018] ====<br />
<br />
As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:<br />
\begin{align}<br />
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} <br />
\end{align}<br />
<br />
where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>. <br />
<br />
Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.<br />
<br />
This approach was shown to provide resilience to all types of adversarial attacks.<br />
<br />
==== How to check for Obfuscated Gradients ====<br />
For future defence proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if a defence is relying on masked gradients:<br />
# If weaker one-step attacks are performing better than iterative attacks.<br />
# Black-box attacks can find stronger adversarial images compared with white-box attacks.<br />
# Unbounded iterative attacks do not reach 100% success.<br />
# If random brute force attempts are better than gradient-based methods at finding adversarial images.<br />
<br />
= Detailed Results =<br />
<br />
As a case study for evaluating the prevalence of obfuscated gradients, the authors studied the ICLR 2018 non-certified defenses that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white box attacks. It is reported that seven of these nine defenses depend on this phenomenon, and the authors demonstrate that their techniques can completely circumvent six of those (and partially circumvent one) that depend on obfuscated gradients.<br />
<br />
== Non-obfuscated Gradients ==<br />
<br />
==== Cascade Adversarial Training, [Na, 2018] ====<br />
<br />
'''Defense''': Similar to the method of [Madry, 2018], the authors of [Na, 2018] propose adversarial training. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.<br />
<br />
'''Attack''': The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.<br />
<br />
== Gradient Shattering ==<br />
<br />
==== Thermometer Coding, [Buckman, 2018] ====<br />
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,<br />
<br />
Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:<br />
\begin{align}<br />
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}<br />
1 \space if \thinspace x_{i,j,c} > \dfrac{k}{l} \\<br />
0 \space otherwise \\<br />
\end{array}<br />
\end{align}<br />
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.<br />
<br />
On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.<br />
<br />
'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,<br />
\begin{align}<br />
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )<br />
\end{align}<br />
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.<br />
<br />
==== Input Transformation, [Guo, 2018] ====<br />
<br />
'''Defence''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high-frequency noise while preserving legitimate edges (good high-frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the database.<br />
<br />
Both techniques, removed perturbations from adversarial images which provide some robustness to adversarial attacks. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations. However, both approaches are non-differentiable and contain test time randomness as the modifications made are input dependent. Gradient flow to the input is non-differentiable and random.<br />
<br />
'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.<br />
<br />
==== Local Intrinsic Dimensionality, [Ma, 2018] ====<br />
<br />
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,<br />
<br />
\begin{align}<br />
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}<br />
\end{align}<br />
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k. <br />
<br />
First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defence method, but a method to study the properties of adversarial images.<br />
<br />
'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.<br />
<br />
== Stochastic Gradients ==<br />
<br />
==== Stochastic Activation Pruning, [Dhillon, 2018] ====<br />
'''Defence''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Because adversarial perturbations are like noises, randomly dropping activation can decrease their collective impact on the classifier. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.<br />
<br />
'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack, they were able to reduce the accuracy of this method down to 0% on CIFAR-10.<br />
<br />
==== Mitigation Through Randomization, [Xie, 2018] ====<br />
'''Defence''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next, the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.<br />
<br />
'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.<br />
<br />
== Vanishing and Exploding Gradients ==<br />
<br />
==== Pixel Defend, [Song, 2018] ====<br />
'''Defence''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back into the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],<br />
<br />
\begin{align}<br />
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))<br />
\end{align}<br />
<br />
The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient, all the way to the input would not be possible because of the vanishing/exploding gradient<br />
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.<br />
<br />
'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this approach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.<br />
<br />
==== Defence-GAN, [Samangouei, 2018] ====<br />
<br />
Before classifying the samples, Defence-GAN projects them onto the data manifold utilizing GAN. The intuition behind this approach is almost similar to that of PixelDefend. It uses GAN instead of pixel CNN.<br />
<br />
The authors used MNIST because CIFAR-10 is not argued secure. They found adversarial examples exist in the generator manifold, and they can construct an example. A perfect projector will not be able to modify this example, however, an imperfect gradient descent approach does not perfectly preserve manifold points. Therefore, the authors attacked DEFENSE-GAN using BPDA, but can only get a 45% success rate.<br />
<br />
<br />
= Conclusion =<br />
In this paper, it was found that gradient masking is a common flaw in many defences claiming robustness against white box adversarial attacks. This leads to a perceived robustness against adversarial attacks when in reality it results in weaker adversarial image construction. The authors develop three attacks that can overcome gradient masking. With their attacks, they found that actual robustness of 7 out of the 9 defences proposed in ICLR-2018, is significantly lower. In fact, many defences were found to be completely ineffective.<br />
<br />
Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attack occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluations of incoming data.<br />
<br />
= Critique =<br />
<br />
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.<br />
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret. <br />
## A notable exception to this case could be applications that are built using open-source (or even published) models that are paired with model-agnostic defense mechanisms. For example, A ResNet-50 using the model-agnostic 'input transformations' technique by [Guo, 2018] may be used in many different image classification tasks, but could still be successfully attacked using BPDA. <br />
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.<br />
# In general, the research field of adversarial attack would benefit from having an all-encompassing benchmark or dataset, so that the various approaches can be objectively compared and evaluated.<br />
<br />
= Other Sources =<br />
<br />
# Their re-implementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].<br />
<br />
= References =<br />
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.<br />
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.<br />
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.<br />
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.<br />
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.<br />
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.<br />
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.<br />
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.<br />
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).<br />
# '''[Na, 2018]''' Na, T., Ko, J.H. and Mukhopadhyay, S., 2017. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv preprint arXiv:1708.02582.<br />
# '''[Papernot et al., 2017]''' Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506–519, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4.<br />
# '''[Tramer et al., 2018]''' Tramer, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples&diff=42132Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples2018-11-30T22:26:15Z<p>Z43ma: </p>
<hr />
<div>= Introduction =<br />
<br />
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are originally correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios. <br />
<br />
The seriousness of this threat has generated major interest in both the design and defense against them. Recently, many new defenses have been proposed that claim robustness against iterative white-box adversarial attacks. This result is somewhat surprising, given that iterative white-box attacks are one of the most difficult classes of adversarial attacks. In this paper, the authors identify a common flaw, masked gradients, in many of these defenses that cause them to ''perceive'' a high accuracy on adversarial images. This flaw is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference were found to contain them. The authors develop three attacks, specifically targeting masked gradients, and show that the actual accuracy of these defenses is much lower than claimed. In fact, the majority of these attacks were found to be ineffective against true iterative white box attacks.<br />
<br />
= Methodology =<br />
<br />
The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.<br />
<br />
== Background ==<br />
<br />
==== Adversarial Images Mathematically ====<br />
<br />
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:<br />
# <math>D(x,x') < \epsilon </math><br />
# <math>c(x') \neq c^*(x) </math><br />
<br />
Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.<br />
<br />
==== Adversarial Attacks Terminology ====<br />
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.<br />
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize'' loss for a target class <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.<br />
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.<br />
<br />
In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.<br />
<br />
== Obfuscated Gradients ==<br />
<br />
If gradients are masked, they cannot be followed to generate adversarial images, gradient masking is known to be an incomplete defense to adversarial images[Papernot et al., 2017; Tramer et al., 2018]. A defense method may appear to be providing robustness, but in reality, the gradients in the network cannot be followed to generate strong adversarial images. Generated adversarial images from these networks are much weaker and when used to evaluate the model robustness five a false sense of security against adversarial attacks. Defenses are designed in a way that the constructed defense inevitably leads to gradient masking as obfuscated gradients.<br />
<br />
Some defenses break gradient descent deliberately, others may do it unintentionally. Some indicators of a broken gradient descent are as follows:<br />
<br />
#One-step attacks perform better than iterative attacks, which are strictly stronger, so this shouldn’t be the case. If single-step methods are working better, it’s a sign the iterative attack is becoming stuck at a local minimum.<br />
#Black-box attacks work better than white-box attacks. The black-box threat model is a strict subset of white-box attacks, so white-box attacks should perform better. When a defence obfuscates gradients, then black-box attacks (which don’t use it) often perform better.<br />
#Unbounded attacks do not reach 100% success. With unbounded distortion, any classifier should eventually fail. An attack that doesn’t achieve this should be improved (i.e., it’s a weak attack, not necessarily a strong defence).<br />
#Random sampling finds adversarial examples Random sampling (e.g., randomly sampling <math>10^5</math> or more points within some <math>\epsilon</math>-ball) should not find adversarial examples when gradient-based attacks do not.<br />
#Increasing the distortion bound does not increase success. Usually, a monotonically increasing attack success rate with increasing distortion bound is expected.<br />
In the defenses proposed in ICLR 2018, there are three ways which defense obfuscate gradients:<br />
<br />
# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally. <br />
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself. <br />
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.<br />
<br />
'''Detecting Obfuscated Gradients''':<br />
<br />
The authors propose a number of tests that might help detect when a defence relies on obfuscated gradients.<br />
<br />
Iterative attacks should work better than single-step attacks, since iterative attacks are strictly stronger than single-step attacks.<br />
White-box attacks should perform better than black-box attacks, since the black-box threat model is a strict subset of the white-box threat model.<br />
Attacks with an unbounded distortion metric (e.g. L_2 norm) should find adversarial examples with 100% success.<br />
Optimization-based attacks should perform better than brute-force sampling of nearby inputs (sampling within an ϵ-ball).<br />
These tests may not cover all cases of obfuscated gradients, but they indicate when intuitive properties start to break down. All defences with obfuscated gradients discussed by the authors fail at least one test.<br />
<br />
== The Attacks ==<br />
<br />
To circumvent these gradient masking techniques, the authors propose:<br />
# '''Backward Pass Differentiable Approximation (BPDA)''': For defences that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation. <br />
# '''Expectation over Transformation [Athalye, 2017]''': For defences that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.<br />
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.<br />
They assume that given a classifier <math display = "inline">f(g(x))</math>, <math display = "inline">g(·)</math> performs some optimization loop to transform the input x to a new input <math display = "inline">\hat x</math>. Often times, differentiating through <math display = "inline">g(·)</math> yields exploding or vanishing gradients.<br />
<br />
To resolve this, they make a change-of-variable <math display = "inline">x = h(z)</math> for some function <math display = "inline">h(·)</math> such that <math display = "inline">g(h(z)) = h(z)</math> for all z, but <math display = "inline">h(·)</math> is differentiable. This allows them to compute gradients through f(h(z)) and hence circumvent the defense.<br />
<br />
= Main Results =<br />
<br />
[[File:Summary_Table.png|600px|center]]<br />
<br />
The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defence targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images. <br />
<br />
Standard models were used in evaluating the accuracy of defense strategies under the attacks,<br />
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)<br />
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)<br />
# Imagenet: InceptionV3 (78.0% top-1 accuracy)<br />
<br />
The last column shows the accuracies each defence method achieved over the adversarial test set. Except for [Madry, 2018], all defence methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically, the defense can be circumvented using their proposed method.<br />
<br />
==== The defense that worked - Adversarial Training [Madry, 2018] ====<br />
<br />
As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:<br />
\begin{align}<br />
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} <br />
\end{align}<br />
<br />
where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>. <br />
<br />
Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.<br />
<br />
This approach was shown to provide resilience to all types of adversarial attacks.<br />
<br />
==== How to check for Obfuscated Gradients ====<br />
For future defence proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if a defence is relying on masked gradients:<br />
# If weaker one-step attacks are performing better than iterative attacks.<br />
# Black-box attacks can find stronger adversarial images compared with white-box attacks.<br />
# Unbounded iterative attacks do not reach 100% success.<br />
# If random brute force attempts are better than gradient-based methods at finding adversarial images.<br />
<br />
= Detailed Results =<br />
<br />
As a case study for evaluating the prevalence of obfuscated gradients, the authors studied the ICLR 2018 non-certified defenses that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white box attacks. It is reported that seven of these nine defenses depend on this phenomenon, and the authors demonstrate that their techniques can completely circumvent six of those (and partially circumvent one) that depend on obfuscated gradients.<br />
<br />
== Non-obfuscated Gradients ==<br />
<br />
==== Cascade Adversarial Training, [Na, 2018] ====<br />
<br />
'''Defense''': Similar to the method of [Madry, 2018], the authors of [Na, 2018] propose adversarial training. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.<br />
<br />
'''Attack''': The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.<br />
<br />
== Gradient Shattering ==<br />
<br />
==== Thermometer Coding, [Buckman, 2018] ====<br />
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,<br />
<br />
Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:<br />
\begin{align}<br />
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}<br />
1 \space if \thinspace x_{i,j,c} > \dfrac{k}{l} \\<br />
0 \space otherwise \\<br />
\end{array}<br />
\end{align}<br />
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.<br />
<br />
On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.<br />
<br />
'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,<br />
\begin{align}<br />
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )<br />
\end{align}<br />
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.<br />
<br />
==== Input Transformation, [Guo, 2018] ====<br />
<br />
'''Defence''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high-frequency noise while preserving legitimate edges (good high-frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the database.<br />
<br />
Both techniques, removed perturbations from adversarial images which provide some robustness to adversarial attacks. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations. However, both approaches are non-differentiable and contain test time randomness as the modifications made are input dependent. Gradient flow to the input is non-differentiable and random.<br />
<br />
'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.<br />
<br />
==== Local Intrinsic Dimensionality, [Ma, 2018] ====<br />
<br />
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,<br />
<br />
\begin{align}<br />
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}<br />
\end{align}<br />
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k. <br />
<br />
First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defence method, but a method to study the properties of adversarial images.<br />
<br />
'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.<br />
<br />
== Stochastic Gradients ==<br />
<br />
==== Stochastic Activation Pruning, [Dhillon, 2018] ====<br />
'''Defence''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Because adversarial perturbations are like noises, randomly dropping activation can decrease their collective impact on the classifier. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.<br />
<br />
'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack, they were able to reduce the accuracy of this method down to 0% on CIFAR-10.<br />
<br />
==== Mitigation Through Randomization, [Xie, 2018] ====<br />
'''Defence''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next, the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.<br />
<br />
'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.<br />
<br />
== Vanishing and Exploding Gradients ==<br />
<br />
==== Pixel Defend, [Song, 2018] ====<br />
'''Defence''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back into the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],<br />
<br />
\begin{align}<br />
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))<br />
\end{align}<br />
<br />
The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient, all the way to the input would not be possible because of the vanishing/exploding gradient<br />
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.<br />
<br />
'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this approach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.<br />
<br />
==== Defence-GAN, [Samangouei, 2018] ====<br />
<br />
Before classifying the samples, Defence-GAN projects them onto the data manifold utilizing GAN. The intuition behind this approach is almost similar to that of PixelDefend. It uses GAN instead of pixel CNN.<br />
<br />
The authors used MNIST because CIFAR-10 is not argued secure. They found adversarial examples exist in the generator manifold, and they can construct an example. A perfect projector will not be able to modify this example, however, an imperfect gradient descent approach does not perfectly preserve manifold points. Therefore, the authors attacked DEFENSE-GAN using BPDA, but can only get a 45% success rate.<br />
<br />
<br />
= Conclusion =<br />
In this paper, it was found that gradient masking is a common flaw in many defences claiming robustness against white box adversarial attacks. This leads to a perceived robustness against adversarial attacks when in reality it results in weaker adversarial image construction. The authors develop three attacks that can overcome gradient masking. With their attacks, they found that actual robustness of 7 out of the 9 defences proposed in ICLR-2018, is significantly lower. In fact, many defences were found to be completely ineffective.<br />
<br />
Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attack occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluations of incoming data.<br />
<br />
= Critique =<br />
<br />
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.<br />
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret. <br />
## A notable exception to this case could be applications that are built using open-source (or even published) models that are paired with model-agnostic defense mechanisms. For example, A ResNet-50 using the model-agnostic 'input transformations' technique by [Guo, 2018] may be used in many different image classification tasks, but could still be successfully attacked using BPDA. <br />
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.<br />
# In general, the research field of adversarial attack would benefit from having an all-encompassing benchmark or dataset, so that the various approaches can be objectively compared and evaluated.<br />
<br />
= Other Sources =<br />
<br />
# Their re-implementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].<br />
<br />
= References =<br />
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.<br />
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.<br />
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.<br />
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.<br />
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.<br />
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.<br />
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.<br />
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.<br />
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).<br />
# '''[Na, 2018]''' Na, T., Ko, J.H. and Mukhopadhyay, S., 2017. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv preprint arXiv:1708.02582.<br />
# '''[Papernot et al., 2017]''' Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506–519, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4.<br />
# '''[Tramer et al., 2018]''' Tramer, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples&diff=42131Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples2018-11-30T22:22:01Z<p>Z43ma: </p>
<hr />
<div>= Introduction =<br />
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are originally correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios. <br />
<br />
The seriousness of this threat has generated major interest in both the design and defense against them. Recently, many new defenses have been proposed that claim robustness against iterative white-box adversarial attacks. This result is somewhat surprising, given that iterative white-box attacks are one of the most difficult classes of adversarial attacks. In this paper, the authors identify a common flaw, masked gradients, in many of these defenses that cause them to ''perceive'' a high accuracy on adversarial images. This flaw is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference were found to contain them. The authors develop three attacks, specifically targeting masked gradients, and show that the actual accuracy of these defenses is much lower than claimed. In fact, the majority of these attacks were found to be ineffective against true iterative white box attacks.<br />
<br />
= Methodology =<br />
<br />
The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.<br />
<br />
== Background ==<br />
<br />
==== Adversarial Images Mathematically ====<br />
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:<br />
# <math>D(x,x') < \epsilon </math><br />
# <math>c(x') \neq c^*(x) </math><br />
<br />
Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.<br />
<br />
==== Adversarial Attacks Terminology ====<br />
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.<br />
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize'' loss for a target class <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.<br />
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.<br />
<br />
In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.<br />
<br />
== Obfuscated Gradients ==<br />
<br />
If gradients are masked, they cannot be followed to generate adversarial images, gradient masking is known to be an incomplete defense to adversarial images[Papernot et al., 2017; Tramer et al., 2018]. A defense method may appear to be providing robustness, but in reality, the gradients in the network cannot be followed to generate strong adversarial images. Generated adversarial images from these networks are much weaker and when used to evaluate the model robustness five a false sense of security against adversarial attacks. Defenses are designed in a way that the constructed defense inevitably leads to gradient masking as obfuscated gradients.<br />
<br />
Some defenses break gradient descent deliberately, others may do it unintentionally. Some indicators of a broken gradient descent are as follows:<br />
<br />
#One-step attacks perform better than iterative attacks, which are strictly stronger, so this shouldn’t be the case. If single-step methods are working better, it’s a sign the iterative attack is becoming stuck at a local minimum.<br />
#Black-box attacks work better than white-box attacks. The black-box threat model is a strict subset of white-box attacks, so white-box attacks should perform better. When a defense obfuscates gradients, then black-box attacks (which don’t use it) often perform better.<br />
#Unbounded attacks do not reach 100% success. With unbounded distortion, any classifier should eventually fail. An attack that doesn’t achieve this should be improved (i.e., it’s a weak attack, not necessarily a strong defense).<br />
#Random sampling finds adversarial examples Random sampling (e.g., randomly sampling <math>10^5</math> or more points within some <math>\epsilon</math>-ball) should not find adversarial examples<br />
when gradient-based attacks do not.<br />
#Increasing the distortion bound does not increase success. Usually, a monotonically increasing attack success rate with increasing distortion bound is expected.<br />
In the defenses proposed in ICLR 2018, there are three ways which defense obfuscate gradients:<br />
<br />
# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally. <br />
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself. <br />
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.<br />
<br />
'''Detecting Obfuscated Gradients''':<br />
<br />
The authors propose a number of tests that might help detect when a defense relies on obfuscated gradients.<br />
<br />
Iterative attacks should work better than single-step attacks, since iterative attacks are strictly stronger than single-step attacks.<br />
White-box attacks should perform better than black-box attacks, since the black-box threat model is a strict subset of the white-box threat model.<br />
Attacks with an unbounded distortion metric (e.g. L_2 norm) should find adversarial examples with 100% success.<br />
Optimization-based attacks should perform better than brute-force sampling of nearby inputs (sampling within an ϵ-ball).<br />
These tests may not cover all cases of obfuscated gradients, but they indicate when intuitive properties start to break down. All defenses with obfuscated gradients discussed by the authors fail at least one test.<br />
<br />
== The Attacks ==<br />
To circumvent these gradient masking techniques, the authors propose:<br />
# '''Backward Pass Differentiable Approximation (BPDA)''': For defenses that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation. <br />
# '''Expectation over Transformation [Athalye, 2017]''': For defenses that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.<br />
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.<br />
They assume that given a classifier <math display = "inline">f(g(x))</math>, <math display = "inline">g(·)</math> performs some optimization loop to transform the input x to a new input <math display = "inline">\hat x</math>. Often times, differentiating through <math display = "inline">g(·)</math> yields exploding or vanishing gradients.<br />
<br />
To resolve this, they make a change-of-variable <math display = "inline">x = h(z)</math> for some function <math display = "inline">h(·)</math> such that <math display = "inline">g(h(z)) = h(z)</math> for all z, but <math display = "inline">h(·)</math> is differentiable. This allows them to compute gradients through f(h(z)) and hence circumvent the defense.<br />
<br />
= Main Results =<br />
[[File:Summary_Table.png|600px|center]]<br />
<br />
The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defense targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images. <br />
<br />
Standard models were used in evaluating the accuracy of defense strategies under the attacks,<br />
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)<br />
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)<br />
# Imagenet: InceptionV3 (78.0% top-1 accuracy)<br />
<br />
The last column shows the accuracies each defense method achieved over the adversarial test set. Except for [Madry, 2018], all defense methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically, the defense can be circumvented using their proposed method.<br />
<br />
==== The defense that worked - Adversarial Training [Madry, 2018] ====<br />
<br />
As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:<br />
\begin{align}<br />
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} <br />
\end{align}<br />
<br />
where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>. <br />
<br />
Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.<br />
<br />
This approach was shown to provide resilience to all types of adversarial attacks.<br />
<br />
==== How to check for Obfuscated Gradients ====<br />
For future defense proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if a defense is relying on masked gradients:<br />
# If weaker one-step attacks are performing better than iterative attacks.<br />
# Black-box attacks can find stronger adversarial images compared with white-box attacks.<br />
# Unbounded iterative attacks do not reach 100% success.<br />
# If random brute force attempts are better than gradient-based methods at finding adversarial images.<br />
<br />
= Detailed Results =<br />
<br />
As a case study for evaluating the prevalence of obfuscated gradients, the authors studied the ICLR 2018 non-certified defenses that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white box attacks. It is reported that seven of these nine defenses depend on this phenomenon, and the authors demonstrate that their techniques can completely circumvent six of those (and partially circumvent one) that depend on obfuscated gradients.<br />
<br />
== Non-obfuscated Gradients ==<br />
<br />
==== Cascade Adversarial Training, [Na, 2018] ====<br />
'''Defense''': Similar to the method of [Madry, 2018], the authors of [Na, 2018] propose adversarial training. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.<br />
<br />
'''Attack''': The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.<br />
<br />
== Gradient Shattering ==<br />
<br />
==== Thermometer Coding, [Buckman, 2018] ====<br />
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,<br />
<br />
Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:<br />
\begin{align}<br />
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}<br />
1 \space if \thinspace x_{i,j,c} > \dfrac{k}{l} \\<br />
0 \space otherwise \\<br />
\end{array}<br />
\end{align}<br />
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.<br />
<br />
On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.<br />
<br />
'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,<br />
\begin{align}<br />
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )<br />
\end{align}<br />
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.<br />
<br />
==== Input Transformation, [Guo, 2018] ====<br />
'''Defense''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high-frequency noise while preserving legitimate edges (good high-frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the database.<br />
<br />
Both techniques, removed perturbations from adversarial images which provide some robustness to adversarial attacks. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations. However, both approaches are non-differentiable and contain test time randomness as the modifications made are input dependent. Gradient flow to the input is non-differentiable and random.<br />
<br />
'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.<br />
<br />
==== Local Intrinsic Dimensionality, [Ma, 2018] ====<br />
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,<br />
<br />
\begin{align}<br />
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}<br />
\end{align}<br />
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k. <br />
<br />
First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defense method, but a method to study the properties of adversarial images.<br />
<br />
'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.<br />
<br />
== Stochastic Gradients ==<br />
<br />
==== Stochastic Activation Pruning, [Dhillon, 2018] ====<br />
'''Defense''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Because adversarial perturbations are like noises, randomly dropping activation can decrease their collective impact on the classifier. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.<br />
<br />
'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack, they were able to reduce the accuracy of this method down to 0% on CIFAR-10.<br />
<br />
==== Mitigation Through Randomization, [Xie, 2018] ====<br />
'''Defense''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next, the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.<br />
<br />
'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.<br />
<br />
== Vanishing and Exploding Gradients ==<br />
<br />
==== Pixel Defend, [Song, 2018] ====<br />
'''Defense''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back into the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],<br />
<br />
\begin{align}<br />
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))<br />
\end{align}<br />
<br />
The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient, all the way to the input would not be possible because of the vanishing/exploding gradient<br />
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.<br />
<br />
'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this approach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.<br />
<br />
==== Defense-GAN, [Samangouei, 2018] ====<br />
<br />
Before classifying the samples, Defense-GAN projects them onto the data manifold utilizing GAN. The intuition behind this approach is almost similar to that of PixelDefend. It uses GAN instead of pixel CNN.<br />
<br />
The authors used MNIST because CIFAR-10 is not argued secure. They found adversarial examples exist in the generator manifold, and they can construct an example. A perfect projector will not be able to modify this example, however, an imperfect gradient descent approach does not perfectly preserve manifold points. Therefore, the authors attacked DEFENSE-GAN using BPDA, but can only get a 45% success rate.<br />
<br />
<br />
= Conclusion =<br />
In this paper, it was found that gradient masking is a common flaw in many defenses claiming robustness against white box adversarial attacks. This leads to a perceived robustness against adversarial attacks when in reality it results in weaker adversarial image construction. The authors develop three attacks that can overcome gradient masking. With their attacks, they found that actual robustness of 7 out of the 9 defenses proposed in ICLR-2018, is significantly lower. In fact, many defenses were found to be completely ineffective.<br />
<br />
Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attack occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluations of incoming data.<br />
<br />
= Critique =<br />
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.<br />
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret. <br />
## A notable exception to this case could be applications that are built using open-source (or even published) models that are paired with model-agnostic defense mechanisms. For example, A ResNet-50 using the model-agnostic 'input transformations' technique by [Guo, 2018] may be used in many different image classification tasks, but could still be successfully attacked using BPDA. <br />
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.<br />
# In general, the research field of adversarial attack would benefit from having an all-encompassing benchmark or dataset, so that the various approaches can be objectively compared and evaluated.<br />
<br />
= Other Sources =<br />
# Their re-implementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].<br />
<br />
= References =<br />
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.<br />
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.<br />
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.<br />
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.<br />
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.<br />
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.<br />
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.<br />
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.<br />
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).<br />
# '''[Na, 2018]''' Na, T., Ko, J.H. and Mukhopadhyay, S., 2017. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv preprint arXiv:1708.02582.<br />
# '''[Papernot et al., 2017]''' Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506–519, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4.<br />
# '''[Tramer et al., 2018]''' Tramer, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Neural_Machine_Translation&diff=42127Unsupervised Neural Machine Translation2018-11-30T22:11:55Z<p>Z43ma: </p>
<hr />
<div>This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]<br />
<br />
= Introduction =<br />
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist. Often, languages can also suffer from having poor resources for translation (e.g. Basque), which could lead to the problem of the dataset being too small (Koehn & Knowles, 2017).<br />
<br />
Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). Their approaches have included pivoting or triangulation techniques [Chen et al., 2017], and semi supervised approaches [He, 2016]. However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.<br />
<br />
The general approach of the methodology is to:<br />
<br />
# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.<br />
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.<br />
Then iteratively perform:<br />
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.<br />
# Tune the decoder in each language by back-translating between the source and target language.<br />
<br />
= Background =<br />
<br />
===Word Embedding Alignment===<br />
<br />
The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. They improve the continuous Skip-gram model for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2. <br />
<br />
Figure 1 shows an example of aligning the word embeddings in English and French.<br />
<br />
[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]<br />
<br />
Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.<br />
<br />
The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.<br />
<br />
===Other related work and inspirations===<br />
====Statistical Decipherment for Machine Translation====<br />
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.<br />
<br />
====Low-Resource Neural Machine Translation====<br />
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.<br />
<br />
Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back-translated to the source language and compared with the original sentence.<br />
<br />
The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.<br />
<br />
= Related Works =<br />
<br />
=== 2.1 UNSUPERVISED CROSS-LINGUAL EMBEDDINGS ===<br />
<br />
A majority of methods for learning cross-lingual word embeddings depend on some bilingual signal at the document level. Embedding mapping methods independently train the embeddings in different languages using monolingual corpora and subsequently learn a linear transformation that maps them to a shared space based on a bilingual dictionary. While the dictionary used in these earlier work typically contains a few thousands entries, Artetxe et al. (2017) propose a simple self-learning extension that gives comparable results with an automatically generated list of numerals, which is used as a shortcut for practical unsupervised learning.<br />
<br />
=== 2.2 STATISTICAL DECIPHERMENT FOR MACHINE TRANSLATION ===<br />
<br />
A considerable body of work in statistical decipherment techniques treat the source language as ciphertext and model the process by which this ciphertext is generated as a two-stage process involving the generation of the original English sequence and the probabilistic replacement of the words in it. The English generative process is modeled using a standard n-gram language model, and the channel model parameters are estimated using either expectation maximization or Bayesian inference. This approach was shown to benefit from the incorporation of syntactic knowledge of the languages involved (Dou & Knight, 2013; Dou et al., 2015). More in line with our proposal, the use of word embeddings has also been shown to bring significant improvements in statistical decipherment for machine translation (Dou et al., 2015). Another newly developed method is using a relatively new deep architecture called Sum-Product network to do machine translation. Hoifung Poon, Pedro Domingos[2011] It is a hybrid model that combines the probabilistic modeling and deep architectures. The main advantage of this model is that it has clear semantics and provide great interoperability, and like many other deep architectures, it can be trained using gradient descent. Sum-product network can be applied in the machine translation field, where one can model the language translation in the following one P(English | French) = p(French / English) * p(English) / p(French), where P(English / French) is the probability that an English text corresponds to a given French text, and P(French/ English) is vice versa. We can use Sum-product network to model each of the above probability and thus doing machine translation.<br />
<br />
=== 2.3 LOW-RESOURCE NEURAL MACHINE TRANSLATION ===<br />
<br />
A simple yet effective approach is to create a synthetic parallel corpus by back-translating a monolingual corpus in the target language (Sennrich et al., 2016a). At the same time, Currey et al. (2017) showed that training an NMT system to directly copy target language text is also helpful and complementary with back-translation. Finally, Ramachandran et al. (2017) pre-train the encoder and the decoder in language modeling. Another method trains two agents to translate in opposite directions (e.g. French → English and English → French), and make them teach each other through a reinforcement learning process. This approach still requires a parallel corpus of a considerable size for a good start.<br />
<br />
= Methodology =<br />
<br />
The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experimented with an alternate way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016] (Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).<br />
<br />
The tokens were then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.<br />
<br />
The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.<br />
<br />
Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:<br />
<br />
#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.<br />
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language. <br />
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is different. This way if the same word occurs in two different languages and has a different meaning in the respective languages then each word would get a different vector in the respective languages despite being in the same vector space. <br />
<br />
[[File:Figure2_lwali.png|600px|center]]<br />
<br />
The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.<br />
<br />
'''Note on the need for alignment:''' To train the decoders (in an admittedly “supervised” manner) we make the assumption that they decode from the same latent space. Thus, given a sentence in either language, it needs to represent it in the same latent space to allow training. However, during the back-translation training, the shared encoder stays fixed. This implies that the encoder needs to be set beforehand. For this reason, the process of embedding and alignment is needed. <br />
<br />
===Denoising===<br />
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.<br />
<br />
Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:<br />
<br />
# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.<br />
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.<br />
<br />
The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the internal structure of a language to decode the sentence into the correct order.<br />
<br />
===Back-Translation===<br />
<br />
With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:<br />
<br />
# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,<br />
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,<br />
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
This approach alleviates issues that would have resulted from the training procedure only dealing with a single language at a time. The corpus of a language is converted to a synthetic translation, and trained to predict the original sentence from this translation. <br />
<br />
Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at once, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.<br />
<br />
===Training===<br />
<br />
Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence. <br />
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.<br />
<br />
The authors use Adam as their optimizer with a learning rate of α = 0.0002 (Kingma & Ba, 2015). During training, dropout regularization is implemented with a drop probability p = 0.3. Given that no parallel data is used for development purposes, the authors perform a fixed number of iterations (300,000) to train each variant. <br />
<br />
Considering recently demonstrated weaker convergence of Adam (compared to SGD), repeating the experiments with other optimizers might provide better results.<br />
<br />
=Experiments and Results=<br />
<br />
The model was evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.<br />
<br />
The paper trained translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.<br />
<br />
[[File:Table1_lwali.png|600px|center]]<br />
<br />
The results exhibit that for the proposed system to work properly, back-translation is necessary. The denoising technique alone is below the baseline while big improvements appear when introducing back-translation.<br />
<br />
===Unsupervised===<br />
<br />
The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.<br />
<br />
The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of back-translation, however, does show large improvement on all tested cases.<br />
<br />
For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.<br />
<br />
===Semi-supervised===<br />
<br />
Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.<br />
<br />
Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.<br />
<br />
===Supervised===<br />
<br />
This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.<br />
<br />
The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.<br />
<br />
===Qualitative Analysis===<br />
<br />
[[File:Table2_lwali.png|600px|center]]<br />
<br />
Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Especially, the authors point that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.<br />
<br />
=Conclusions and Future Work=<br />
<br />
The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.<br />
<br />
Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:<br />
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.<br />
*Decouple the shared encoder into 2 independent encoders at some point during training<br />
*Progressively reduce the noise level<br />
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis<br />
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.<br />
<br />
= Critique =<br />
<br />
While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution. <br />
<br />
As pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.<br />
<br />
The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.<br />
<br />
Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results. <br />
<br />
The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.<br />
<br />
The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.<br />
<br />
Their qualitative analysis just checks whether their proposed unsupervised NMT generates a sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.<br />
<br />
* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.<br />
<br />
= References =<br />
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."<br />
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".<br />
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."<br />
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."<br />
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."<br />
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."<br />
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."<br />
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."<br />
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"<br />
#'''[ Koehn & Knowles, 2017]''' Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation.<br />
#'''[Chen et al., 2017]''' Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. A teacher-student framework for zero-resource neural machine translation.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Neural_Machine_Translation&diff=42126Unsupervised Neural Machine Translation2018-11-30T22:11:08Z<p>Z43ma: </p>
<hr />
<div>This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]<br />
<br />
= Introduction =<br />
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist. Often, languages can also suffer from having poor resources for translation (e.g. Basque), which could lead to the problem of the dataset being too small (Koehn & Knowles, 2017).<br />
<br />
Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). Their approaches have included pivoting or triangulation techniques [Chen et al., 2017], and semi supervised approaches [He, 2016]. However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.<br />
<br />
The general approach of the methodology is to:<br />
<br />
# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.<br />
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.<br />
Then iteratively perform:<br />
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.<br />
# Tune the decoder in each language by back-translating between the source and target language.<br />
<br />
= Background =<br />
<br />
===Word Embedding Alignment===<br />
<br />
The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. They improve the continuous Skip-gram model for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2. <br />
<br />
Figure 1 shows an example of aligning the word embeddings in English and French.<br />
<br />
[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]<br />
<br />
Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.<br />
<br />
The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.<br />
<br />
===Other related work and inspirations===<br />
====Statistical Decipherment for Machine Translation====<br />
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.<br />
<br />
====Low-Resource Neural Machine Translation====<br />
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.<br />
<br />
Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back-translated to the source language and compared with the original sentence.<br />
<br />
The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.<br />
<br />
= Related Works =<br />
<br />
=== 2.1 UNSUPERVISED CROSS-LINGUAL EMBEDDINGS ===<br />
<br />
A majority of methods for learning cross-lingual word embeddings depend on some bilingual signal at the document level. Embedding mapping methods independently train the embeddings in different languages using monolingual corpora and subsequently learn a linear transformation that maps them to a shared space based on a bilingual dictionary. While the dictionary used in these earlier work typically contains a few thousands entries, Artetxe et al. (2017) propose a simple self-learning extension that gives comparable results with an automatically generated list of numerals, which is used as a shortcut for practical unsupervised learning.<br />
<br />
=== 2.2 STATISTICAL DECIPHERMENT FOR MACHINE TRANSLATION ===<br />
<br />
A considerable body of work in statistical decipherment techniques treat the source language as ciphertext and model the process by which this ciphertext is generated as a two-stage process involving the generation of the original English sequence and the probabilistic replacement of the words in it. The English generative process is modeled using a standard n-gram language model, and the channel model parameters are estimated using either expectation maximization or Bayesian inference. This approach was shown to benefit from the incorporation of syntactic knowledge of the languages involved (Dou & Knight, 2013; Dou et al., 2015). More in line with our proposal, the use of word embeddings has also been shown to bring significant improvements in statistical decipherment for machine translation (Dou et al., 2015). Another newly developed method is using a relatively new deep architecture called Sum-Product network to do machine translation. Hoifung Poon, Pedro Domingos[2011] It is a hybrid model that combines the probabilistic modeling and deep architectures. The main advantage of this model is that it has clear semantics and provide great interoperability, and like many other deep architectures, it can be trained using gradient descent. Sum-product network can be applied in the machine translation field, where one can model the language translation in the following one P(English | French) = p(French / English) * p(English) / p(French), where P(English / French) is the probability that an English text corresponds to a given French text, and P(French/ English) is vice versa. We can use Sum-product network to model each of the above probability and thus doing machine translation.<br />
<br />
=== 2.3 LOW-RESOURCE NEURAL MACHINE TRANSLATION ===<br />
<br />
A simple yet effective approach is to create a synthetic parallel corpus by back-translating a monolingual corpus in the target language (Sennrich et al., 2016a). At the same time, Currey et al. (2017) showed that training an NMT system to directly copy target language text is also helpful and complementary with back-translation. Finally, Ramachandran et al. (2017) pre-train the encoder and the decoder in language modeling. Another method trains two agents to translate in opposite directions (e.g. French → English and English → French), and make them teach each other through a reinforcement learning process. This approach still requires a parallel corpus of a considerable size for a good start.<br />
<br />
= Methodology =<br />
<br />
The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experimented with an alternate way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016] (Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).<br />
<br />
The tokens were then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.<br />
<br />
The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.<br />
<br />
Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:<br />
<br />
#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.<br />
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language. <br />
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is different. This way if the same word occurs in two different languages and has a different meaning in the respective languages then each word would get a different vector in the respective languages despite being in the same vector space. <br />
<br />
[[File:Figure2_lwali.png|600px|center]]<br />
<br />
The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.<br />
<br />
'''Note on the need for alignment:''' To train the decoders (in an admittedly “supervised” manner) we make the assumption that they decode from the same latent space. Thus, given a sentence in either language, it needs to represent it in the same latent space to allow training. However, during the back-translation training, the shared encoder stays fixed. This implies that the encoder needs to be set beforehand. For this reason, the process of embedding and alignment is needed. <br />
<br />
===Denoising===<br />
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.<br />
<br />
Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:<br />
<br />
# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.<br />
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.<br />
<br />
The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the internal structure of a language to decode the sentence into the correct order.<br />
<br />
===Back-Translation===<br />
<br />
With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:<br />
<br />
# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,<br />
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,<br />
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
This approach alleviates issues that would have resulted from the training procedure only dealing with a single language at a time. The corpus of a language is converted to a synthetic translation, and trained to predict the original sentence from this translation. <br />
<br />
Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at once, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.<br />
<br />
===Training===<br />
<br />
Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence. <br />
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.<br />
<br />
The authors use Adam as their optimizer with a learning rate of α = 0.0002 (Kingma & Ba, 2015). During training, dropout regularization is implemented with a drop probability p = 0.3. Given that no parallel data is used for development purposes, the authors perform a fixed number of iterations (300,000) to train each variant. <br />
<br />
Considering recently<br />
<br />
=Experiments and Results=<br />
<br />
The model was evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.<br />
<br />
The paper trained translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.<br />
<br />
[[File:Table1_lwali.png|600px|center]]<br />
<br />
The results exhibit that for the proposed system to work properly, back-translation is necessary. The denoising technique alone is below the baseline while big improvements appear when introducing back-translation.<br />
<br />
===Unsupervised===<br />
<br />
The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.<br />
<br />
The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of back-translation, however, does show large improvement on all tested cases.<br />
<br />
For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.<br />
<br />
===Semi-supervised===<br />
<br />
Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.<br />
<br />
Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.<br />
<br />
===Supervised===<br />
<br />
This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.<br />
<br />
The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.<br />
<br />
===Qualitative Analysis===<br />
<br />
[[File:Table2_lwali.png|600px|center]]<br />
<br />
Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Especially, the authors point that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.<br />
<br />
=Conclusions and Future Work=<br />
<br />
The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.<br />
<br />
Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:<br />
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.<br />
*Decouple the shared encoder into 2 independent encoders at some point during training<br />
*Progressively reduce the noise level<br />
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis<br />
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.<br />
<br />
= Critique =<br />
<br />
While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution. <br />
<br />
As pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.<br />
<br />
The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.<br />
<br />
Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results. <br />
<br />
The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.<br />
<br />
The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.<br />
<br />
Their qualitative analysis just checks whether their proposed unsupervised NMT generates a sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.<br />
<br />
* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.<br />
<br />
= References =<br />
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."<br />
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".<br />
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."<br />
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."<br />
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."<br />
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."<br />
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."<br />
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."<br />
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"<br />
#'''[ Koehn & Knowles, 2017]''' Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation.<br />
#'''[Chen et al., 2017]''' Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. A teacher-student framework for zero-resource neural machine translation.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=42121Learning to Teach2018-11-30T21:52:59Z<p>Z43ma: </p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half<br />
of the training data to train a ResNet model as the student.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. <br />
<br />
The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
==Problem Definition==<br />
The student model, denoted &mu;(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:<br />
<br />
\begin{align*}<br />
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)<br />
\end{align*}<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
In contrast to traditional machine learning, which is only concerned with the student model in the<br />
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide<br />
appropriate inputs to the student model so that it can achieve low risk functional as efficiently<br />
as possible.<br />
<br />
<br />
::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.<br />
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,<br />
the teacher model can be used to teach either<br />
new student models, or the same student<br />
models in new learning scenarios such as another<br />
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations<br />
S are the same across different student<br />
models and different scenarios. The L2T process is outlined in figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.<br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space. <br />
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.<br />
<br />
Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of teacher in the L2T framework is <br />
<br />
<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center><br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns. <br />
<br />
The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.<br />
<br />
<br />
The optimizer for training the teacher model is the maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]<br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.<br />
<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.<br />
===Training a New Student===<br />
<br />
In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
4) As they have focused on data teaching exploring loss function teaching would be interesting.<br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.<br />
<br />
The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=42119Learning to Teach2018-11-30T21:37:47Z<p>Z43ma: </p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half<br />
of the training data to train a ResNet model as the student.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. <br />
<br />
The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
==Problem Definition==<br />
The student model, denoted &mu;(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:<br />
<br />
\begin{align*}<br />
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)<br />
\end{align*}<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
In contrast to traditional machine learning, which is only concerned with the student model in the<br />
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide<br />
appropriate inputs to the student model so that it can achieve low risk functional as efficiently<br />
as possible.<br />
<br />
<br />
::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.<br />
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,<br />
the teacher model can be used to teach either<br />
new student models, or the same student<br />
models in new learning scenarios such as another<br />
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations<br />
S are the same across different student<br />
models and different scenarios. The L2T process is outlined in figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.<br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space. <br />
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.<br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns. <br />
<br />
The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.<br />
<br />
<br />
The optimizer for training the teacher model is the maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]<br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.<br />
<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.<br />
===Training a New Student===<br />
<br />
In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
4) As they have focused on data teaching exploring loss function teaching would be interesting.<br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.<br />
<br />
The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ShakeDrop_Regularization&diff=42000ShakeDrop Regularization2018-11-30T02:54:22Z<p>Z43ma: </p>
<hr />
<div>=Introduction=<br />
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply. To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed.ShakeDrop disturbs learning more strongly by multiplying even a negative factor to the output of a convolutional layer in the forward training pass. In addition, a different factor from the forward pass is multiplied in the backward training pass. As a byproduct, however, learning process gets unstable. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.<br />
<br />
=Existing Methods=<br />
<br />
'''Deep Approaches'''<br />
<br />
'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.<br />
<br />
Intuition behind Residual blocks:<br />
If the identity mapping is optimal, We can easily push the residuals to zero (F(x) = 0) than to fit an identity mapping (x, input=output) by a stack of non-linear layers. In simple language it is very easy to come up with a solution like F(x) =0 rather than F(x)=x using stack of non-linear cnn layers as function (Think about it). So, this function F(x) is what the authors called Residual function ([https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624 Reference]).<br />
<br />
<br />
[[File:ResidualBlock.png|580px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]<br />
<br />
ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.<br />
<br />
'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.<br />
<br />
[[File:ResidualBlockComparison.png|980px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017. The width of a block reflects the number of channels used in that layer.]]<br />
<br />
<br />
'''Non-Deep Approaches'''<br />
<br />
'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.<br />
<br />
'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.<br />
<br />
<br />
[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]<br />
<br />
<br />
'''Regularization Methods For Residual Blocks'''<br />
<br />
'''Stochastic Depth''' works by randomly dropping paths in the residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Unlike sequential networks, there are many paths from the input to the output in these networks. By dropping some of the connections, the network is forced to flow through different paths to get the final deep layer representation. In a way it is similar to dropout, but for paths in multi-path networks. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter. Essentially, the probability of a connection dropping in inversely proportional to the its depth in the network.<br />
<br />
'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt (multiple residual connections) architecture. It is given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. Essentially, one of the parallel residual connections is dropped in the forward direction. This is similar to stochastic depth regularization, but a residual path always exists.<br />
Moreover, on the backward pass a similar random variable <math>\beta</math> is used to independently drop paths for gradient flow. This has the effect of adding noise in the gradients update process and improved performance over the vanilla ResNeXt network.<br />
<br />
<br />
[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]<br />
<br />
=Proposed Method=<br />
We give an intuitive interpretation of the forward pass of Shake-Shake regularization. To the best of our knowledge, it has not been given yet, while the phenomenon in the backward pass is experimentally investigated by Gastaldi (2017). In the forward pass, Shake-Shake interpolates the outputs of two residual branches with a random variable α that controls the degree of interpolation. As DeVries & Taylor (2017a) demonstrated that interpolation of two data in the feature space can synthesize reasonable augmented data, the interpolation of two residual blocks of Shake-Shake in the forward pass can be interpreted as synthesizing data. Use of a random variable α generates many different augmented data. On the other hand, in the backward pass, a different random variable β is used to disturb learning to make the network learnable long time. Gastaldi (2017) demonstrated how the difference between <math>\alpha</math> and <math>\beta</math> affects.<br />
<br />
The regularization mechanism of Shake-Shake relies on two or more residual branches, so that it can be applied only to 2-branch networks architectures. In addition, 2-branch network architectures consume more memory than 1-branch network architectures. One may think the number of learnable parameters of ResNeXt can be kept in 1-branch and 2-branch network architectures by controlling its cardinality and the number of channels (filters). For example, a 1-branch network (e.g., ResNeXt 1-64d) and its corresponding 2-branch network (e.g., ResNeXt 2-40d) have almost same number of learnable parameters. However, even so, it increases memory consumption due to the overhead to keep the inputs of residual blocks and so on. By comparing ResNeXt 1-64d and 2-40d, the latter requires more memory than the former by 8% in theory (for one layer) and by 11% in measured values (for 152 layers).<br />
<br />
This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.<br />
<br />
This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When mixing up two networks, we expected the following effects: When the non regularized network is selected, learning is promoted; when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two. <br />
<br />
'''ShakeDrop''' is given as <br />
<br />
<div align="center"><br />
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,<br />
</div><br />
<br />
where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is <br />
<br />
<div align="center"><br />
<math><br />
G(x) = \begin{cases}<br />
x + F(x) ~~ \text{if } b_l = 1 \\<br />
x + \alpha F(x) ~~ \text{otherwise}<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. The authors also found that the linear decay rule of ResDrop works well, compared with the uniform rule. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.<br />
<br />
=Experiments=<br />
<br />
'''Parameter Search'''<br />
<br />
The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below. <br />
<br />
[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]<br />
<br />
The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.<br />
<br />
[[File:ParameterUpdateShakeDrop.png|400px|centre]]<br />
<br />
Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.<br />
<br />
'''Comparison with Regularization Methods'''<br />
<br />
For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch). <br />
<br />
These experiments are performed on the CIFAR-100 dataset.<br />
<br />
[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]<br />
<br />
For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing. <br />
<br />
[[File:CosineAnnealing.png|400px|centre|thumb|]]<br />
<br />
The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017). <br />
<br />
[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]<br />
<br />
'''State-of-the-Art Comparisons'''<br />
<br />
A direct comparison with state of the art methods is favorable for this new method. <br />
<br />
# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.<br />
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.<br />
# Comparison with the state-of-the-arts, PyramidNet + ShakeDrop gives an improvement of 0.25% on CIFAR-10 than ResNeXt + Shake-Shake + Cutout, PyramidNet + ShakeDrop gives an improvement of 2.85% on CIFAR-100 than Coupled Ensemble.<br />
<br />
=Implementation details=<br />
<br />
'''CIFAR-10/100 datasets'''<br />
<br />
All the images in these datasets were color normalized and then horizontally flipped with a probability of 50%. All of then then were zero padded to have a dimentionality of 40 by 40 pixels.<br />
<br />
<br />
=Conclusion=<br />
The paper proposes a new form of regularization that is an extension of "Shake-Shake" regularization [Gastaldi, 2017]. The original "shake-shake" proposes using two residual paths adding to the same output, and during training, considering different randomly selected convex combinations of the two paths (while using an equally weighted combination at test time). This paper contends that this requires additional memory, and attempts to achieve similar regularization with a single path. To do so, they train a network with a single residual path, where the residual is included without attenuation in some cases with some fixed probability, and attenuated randomly (or even inverted) in others. The paper contends that this achieves superior performance than choosing simply a random attenuation for every sample (although, this can be seen as choosing an attenuation under a distribution with some fixed probability mass.<br />
<br />
Their stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.<br />
<br />
=Critique=<br />
<br />
The novelty of this paper is low as pointed out by the reviewers. Also, there is a confusion whether or not the results could be replicated as <math>\alpha</math> and <math>\beta</math> are choosen randomly. The proposed ShakeDrop regularization is essentially a combination of the PyramidDrop and Shake-Shake regularization. The most surprising part is that the forward weight can be negative thus inverting the output of a convolution. The mathematical justification for ShakeDrop regularization is limited, relying on intuition and empirical evidence instead.<br />
<br />
One downside of this methods (as was identified in the presentation as well) is that the training for cosine annealing variation of the model takes 1800 epochs which is time intensive compared to other methods that were compared as baselines. This can limit practical implementation of this algorithm.<br />
<br />
As pointed out from the above, the method basically relies heavily on the intuition. This means that the performance of the algorithm can not been extended beyond the CIFAR dataset and can vary a lot depending on the characteristics of data sets that users are performing, with some exaggeration. However, the performance is still impressive since it performs better than known algorithms. It is not clear as to how the proposed technique would work with a non-residual architecture.<br />
It lacks conclusive proof that "shake-drop" is a generically useful regularization technique. For one, the method is evaluated only on small toy-datasets: CIFAR-10 and CIFAR-100. Evaluation on Imagenet perhaps would have been valuable.<br />
<br />
=References=<br />
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.<br />
<br />
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.<br />
<br />
[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.<br />
<br />
[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.<br />
<br />
[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.<br />
<br />
[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.<br />
<br />
[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.<br />
<br />
[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.<br />
<br />
[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.<br />
<br />
[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.<br />
<br />
[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41986DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-30T00:48:16Z<p>Z43ma: </p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [https://arxiv.org/pdf/1711.00489.pdf]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<center><math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
</center><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
4- In the early stage, large batch size will lead to the instabilities.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.<br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41984DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-30T00:45:25Z<p>Z43ma: </p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [https://arxiv.org/pdf/1711.00489.pdf]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<center><math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
</center><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
4- In the early stage, large batch size will lead to the instabilities.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41983DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-30T00:44:50Z<p>Z43ma: </p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [https://arxiv.org/pdf/1711.00489.pdf]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise g decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes g decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<center><math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
</center><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
4- In the early stage, large batch size will lead to the instabilities.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41982DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-30T00:40:51Z<p>Z43ma: </p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [https://arxiv.org/pdf/1711.00489.pdf]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise g decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes g decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<center><math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
</center><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
4- In the early stage, large batch size will lead to the instabilities.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41981DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-30T00:39:07Z<p>Z43ma: </p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [https://arxiv.org/pdf/1711.00489.pdf]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima it is beneficial for us to take large steps towards it as it would require a lesser number of steps to reach but as we approach it our step size should decrease otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise g decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes g decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<center><math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
</center><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
4- In the early stage, large batch size will lead to the instabilities.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41980DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-30T00:38:53Z<p>Z43ma: </p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [https://arxiv.org/pdf/1711.00489.pdf]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of the neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima it is beneficial for us to take large steps towards it as it would require a lesser number of steps to reach but as we approach it our step size should decrease otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise g decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes g decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<center><math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
</center><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
4- In the early stage, large batch size will lead to the instabilities.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Z43mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41978DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-30T00:36:45Z<p>Z43ma: </p>
<hr />
<div>=Introduction=<br />
<br />
It has been commonly believed that one major advantage of neural networks is their capability of modelling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.<br />
<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].<br />
<br />
In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix.<br />
<br />
Note that in this paper, we only consider one specific types of neural network, feedforward neural network. Based on the methodology discussed here, the authors suggest that we can build an interpretation method for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
* Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations. <br />
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
<br />
The approach in this paper is to extract non-additive interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer.<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut-off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut-off model which is a generalized additive model (GAM) as below,<br />
<br />
<center><math><br />
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)<br />
</math></center><br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, We study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, we are going to test on 10 synthetic functions as shown in table I.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
We use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing<br />
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks. Also, it was pointed out that the neural network weights heavily depend on L-1 regularized neural network training, but a group lasso penalty may work better.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006<br />
<br />
[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.<br />
<br />
[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.<br />
<br />
[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.<br />
<br />
[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.<br />
<br />
[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.<br />
<br />
[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.<br />
<br />
[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.</div>Z43ma